Saturday, August 15, 2009

Eddy's primer on the BLOSUM62 alignment score matrix

As a Linux/Unix fun, I like its philosophy, as summarized by Doug McIlroy, very much: "Write programs that do one thing and do it well. Write programs to work together." In science, I enjoy more reading an article that focuses on one point and address it clearly and thoroughly. It is only after a complete understanding of the components can it be possible to combine them in unique, purpose-specific ways. Those days, though, such type of simple-and-clean articles is no longer that common as it was in the early days, say 1960s or 70s. This post is the first of a series on such articles I've found useful, or on computational tricks that I have learned over the years.

I came across the primer titled "Where did the BLOSUM62 alignment score matrix come from?" by Sean Eddy [Nat Biotechnol. 2004 Aug;22(8):1035-6] early this month through the BioConductor mailing list where it was recommended by Dr. Philipp Pagel as a "well written article". I read the title and the short abstract from PubMed, and then download the PDF version of the whole article.

As a primer, it is only two pages long (short), and reading twice won't take that much time. It explains the meaning of the BLOSUM62 amino acid score matrix and where is comes from clearly. The number 62, for example, stands for a threshold of 62% identity. Other percentages, such as 80% (more Conservative), 45% (more divergent) are also possible and may be more suitable for specific applications. As noted by Eddy, "Empirically, the BLOSUM matrices have performed very well. BLOSUM62 has become a de facto standard for many protein alignment programs."

With the above background, it is much easier to understand the fundamental difference of the default scoring systems between FASTA/WU-BLASTN vs NCBI BLASTN for alignment of DNA sequences: the former is optimal for alignments at the ‘twilight zone’ (65% identity), while the later (NCBI BLASTN) is optimal for homologous DNA at a much higher 95% identity level. Knowing of such subtlety is very important in avoid making false conclusions.

More significantly (to me), there is a supplemental material -- a well-documented, self-contained "ANSI C program for calculating the implicit target frequencies pab of a score matrix": it clarifies every details for those who want to get to the bottom of the topic.

Sunday, August 9, 2009

On third-party interfaces to 3DNA

One clear sign of 3DNA's acceptance by the nucleic-acid structure-related scientific community is its integration into a wide range of GUI/web-interfaces (e.g., NDB, w3DNA, 3D-DART etc) and many other packages. As the author and maintainer of the 3DNA software package, I am (of course) very happy to see this happen: I can easily imagine that better accessibility would make 3DNA open to an even larger community, including non-experts in nucleic acid structures (e.g., for educational purpose). It should be emphasized, however, that point-and-click interfaces to 3DNA (and any other serious scientific software, for that matter), while convenient to occasional users and for routine jobs, have their limitations.

Overall, a point-and-click interface is only sensible to well-defined, routine tasks. In the context of 3DNA, the following list of tasks (not necessarily complete) is perfectly suitable:
  • Generate one of the 55 fiber models, where everything (model number, sequence or number of repeats) can be unambiguously defined through a web-form
  • Build an arbitrary DNA model with a user-specific parameter set
  • Create a blocview image for a nucleic-acid containing structure specified in PDB format
  • Analyze a regular (i.e., not deviating too much) double helix structure, be it in A-, B-, or Z-form
  • Calculate a list of backbone torsion angles and other parameters (with the "-s" option of find_pair and then analyze)
  • Find a list of all possible (RNA) base-pairs fulfilling a specific set of geometric criteria (with the "-p" option of find_pair).
Other than those (and carefully selected/qualified other functionality), users should be very careful in what they get from running a third-party tool outputting 3DNA-related parameters. I know of the so many subtleties in using 3DNA to solve a specific problem, and most of the time the command-line driven style is the most effective, as it allows for easy automation of try-and-error iterations. Software developers who integrate 3DNA into their packages should make the limitations clear to their users, and direct any 3DNA-specific problems to the 3DNA forum.

In today's informatics world, there are so many "easy-to-use" tools available, claiming to be able to solve all sorts of problems (well, that's understandable -- otherwise, how could one get published, especially in the big journals/magazines?). Any serious scientist, however, should know what he/she is doing: it is easy to get some (magic) numbers by clicking a button, but understand clearly what the numbers mean is yet another story. I cannot emphasize more the importance of knowing one's tools, including their limitations.