Sunday, July 26, 2009

Meaning of nucleotide IUPAC codes

Today, anyone with some basic knowledge of biochemistry should be familiar with A, C, G and T, the four bases of DNA, and probably the A-T and G-C Watson-Crick base-pairs as well. The meaning of the one-letter abbreviations is very clear: A for Adenine, C for Cytosine, G for Guanine, and T for Thymine. Of course, for RNA, there is the U (for Uracil) in place of T of DNA.

In the early days when I entered into the field of DNA structure, I also learned that R stands for puRine, i.e., A and G, and Y for pYrimidine, i.e., C and T (U). Trained as a chemist, I had no difficult at all in understanding and remembering them. To process base sequences in bioinformatics projects, I have come across the IUPAC degeneracy codes of nucleotides, such as S, W, M, K D, V, etc, which I had never been able to really memorize what they represent, except for N (A, C, G, T).

My confusions have been clarified completely, however, due to the web document I happened to find: "Nomenclature for Incompletely Specified Bases in Nucleic Acid Sequences" (1984) by the Nomenclature Committee of the International Union of Biochemistry (NC-IUB). This is the document I wish I could have known of from the very beginning. For completeness of this post, here is a summary table of the whole IUPAC codes. It is based on Table 1 in the above document except for uracil and gap:

Symbol Meaning          Origin of designation
-----------------------------------------------------------
G G Guanine
A A Adenine
T T Thymine
C C Cytosine
U U Uracil
R G or A puRine
Y T or C pYrimidine
M A or C aMino
K G or T Keto
S G or C Strong interaction (3 H bonds)
W A or T Weak interaction (2 H bonds)
H A or C or T not-G, H follows G in the alphabet
B G or T or C not-A, B follows A
V G or C or A not-T (not-U), V follows U
D G or A or T not-C, D follows C
N G or A or T or C aNy
. or - gap
-----------------------------------------------------------