As a Linux/Unix fun, I like its philosophy, as summarized by Doug McIlroy, very much: "Write programs that do one thing and do it well. Write programs to work together." In science, I enjoy more reading an article that focuses on one point and address it clearly and thoroughly. It is only after a complete understanding of the components can it be possible to combine them in unique, purpose-specific ways. Those days, though, such type of simple-and-clean articles is no longer that common as it was in the early days, say 1960s or 70s. This post is the first of a series on such articles I've found useful, or on computational tricks that I have learned over the years.
I came across the primer titled "Where did the BLOSUM62 alignment score matrix come from?" by Sean Eddy [Nat Biotechnol. 2004 Aug;22(8):1035-6] early this month through the BioConductor mailing list where it was recommended by Dr. Philipp Pagel as a "well written article". I read the title and the short abstract from PubMed, and then download the PDF version of the whole article.
As a primer, it is only two pages long (short), and reading twice won't take that much time. It explains the meaning of the BLOSUM62 amino acid score matrix and where is comes from clearly. The number 62, for example, stands for a threshold of 62% identity. Other percentages, such as 80% (more Conservative), 45% (more divergent) are also possible and may be more suitable for specific applications. As noted by Eddy, "Empirically, the BLOSUM matrices have performed very well. BLOSUM62 has become a de facto standard for many protein alignment programs."
With the above background, it is much easier to understand the fundamental difference of the default scoring systems between FASTA/WU-BLASTN vs NCBI BLASTN for alignment of DNA sequences: the former is optimal for alignments at the ‘twilight zone’ (65% identity), while the later (NCBI BLASTN) is optimal for homologous DNA at a much higher 95% identity level. Knowing of such subtlety is very important in avoid making false conclusions.
More significantly (to me), there is a supplemental material -- a well-documented, self-contained "ANSI C program for calculating the implicit target frequencies pab of a score matrix": it clarifies every details for those who want to get to the bottom of the topic.