Saturday, August 15, 2009

Eddy's primer on the BLOSUM62 alignment score matrix

As a Linux/Unix fun, I like its philosophy, as summarized by Doug McIlroy, very much: "Write programs that do one thing and do it well. Write programs to work together." In science, I enjoy more reading an article that focuses on one point and address it clearly and thoroughly. It is only after a complete understanding of the components can it be possible to combine them in unique, purpose-specific ways. Those days, though, such type of simple-and-clean articles is no longer that common as it was in the early days, say 1960s or 70s. This post is the first of a series on such articles I've found useful, or on computational tricks that I have learned over the years.


I came across the primer titled "Where did the BLOSUM62 alignment score matrix come from?" by Sean Eddy [Nat Biotechnol. 2004 Aug;22(8):1035-6] early this month through the BioConductor mailing list where it was recommended by Dr. Philipp Pagel as a "well written article". I read the title and the short abstract from PubMed, and then download the PDF version of the whole article.

As a primer, it is only two pages long (short), and reading twice won't take that much time. It explains the meaning of the BLOSUM62 amino acid score matrix and where is comes from clearly. The number 62, for example, stands for a threshold of 62% identity. Other percentages, such as 80% (more Conservative), 45% (more divergent) are also possible and may be more suitable for specific applications. As noted by Eddy, "Empirically, the BLOSUM matrices have performed very well. BLOSUM62 has become a de facto standard for many protein alignment programs."

With the above background, it is much easier to understand the fundamental difference of the default scoring systems between FASTA/WU-BLASTN vs NCBI BLASTN for alignment of DNA sequences: the former is optimal for alignments at the ‘twilight zone’ (65% identity), while the later (NCBI BLASTN) is optimal for homologous DNA at a much higher 95% identity level. Knowing of such subtlety is very important in avoid making false conclusions.

More significantly (to me), there is a supplemental material -- a well-documented, self-contained "ANSI C program for calculating the implicit target frequencies pab of a score matrix": it clarifies every details for those who want to get to the bottom of the topic.

2 comments:

  1. Hi,

    Please tell me where I can get my hands on the supplementary notes and the "ANSI C program for calculating the implicit target frequencies pab of a score matrix". I have tried looking but to no avail.

    Thanks in advance

    ReplyDelete
  2. Hi,

    It is in the Supplementary Notes (doc 81K) at URL:

    http://www.nature.com/nbt/journal/v22/n8/suppinfo/nbt0804-1035_S1.html

    "A program for taking a (possibly arbitrary) alignment score matrix and back-calculating the implied target frequencies pab.

    Doing this requires solving for a nonzero lambda in: \sum_ab f_a f_b e{\lambda s_ab} = 1 and this is a good excuse to demo two methods of root-finding: bisection search and the Newton/Raphson method.

    The program is ANSI C, and should compile on any machine with a C compiler: % cc -o lambda lambda.c -lm Any questions about this program should be addressed directly to the author."

    HTH,

    Xiang-Jun

    ReplyDelete

You are welcome to make a comment. Just remember to be specific and follow common-sense etiquette.