Saturday, August 22, 2009

Fit a least squares plane to a set of points

In the early days when I first got into the field of DNA structures, I was interested in knowing details on how the "best mean (base-pair) plane" is actually defined. I could not find a solution from the articles I read at that time. For clarification, I sent emails to some authors whose papers mentioned the concept, still the results were less than satisfactory.

The mystery was finally solved after I read the following two articles from early days (traced initially from a comment in the source code of Dickerson's NewHelix, if I recall it correctly):
  • D. M. Blow (1960). ``To Fit a Plane to a Set of Points by Least Squares.'' Acta Cryst., 13, 168.
  • V. Schomaker et al. (1959). ``To Fit a Plane or a Line to a Set of Points by Least Squares.'' Acta Cryst., 12, 600-604.
As is clear from the titles, they focused just on least-squares plane/line fitting algorithms. More specifically, the worked example(s) help make the underlying concept clear.

As a concrete example, let's work out the least-square plane of an adenine using Octave/Matlab. The implementation here is much simpler than those in the literature cited above. Note: this example is mentioned in the 3DNA home page in the Technical Details section, so the order of atoms is kept the same as there.

The x-, y-, and z-coordinates are as follows:
xyz = [
16.461 17.015 14.676 % 91 N9
15.775 18.188 14.459 % 100 C4
14.489 18.449 14.756 % 99 N3
14.171 19.699 14.406 % 98 C2
14.933 20.644 13.839 % 97 N1
16.223 20.352 13.555 % 95 C6
16.984 21.297 12.994 % 96 N6
16.683 19.056 13.875 % 94 C5
17.918 18.439 13.718 % 93 N7
17.734 17.239 14.207 % 92 C8

Cmtx = cov(xyz)
[V, LAMBDA] = eig(Cmtx)

The covariance matrix (Cmtx), its corresponding eigenvectors (V) and eigenvalues (LAMBDA) are as follows:
Cmtx =
1.66800 -0.50154 -0.32530
-0.50154 2.06703 -0.58401
-0.32530 -0.58401 0.30605

V =
0.27367 0.83264 -0.48147
0.32243 0.39220 0.86152
0.90617 -0.39102 -0.16113

0.00001 0.00000 0.00000
0.00000 1.58452 0.00000
0.00000 0.00000 2.45656
The smallest eigenvalue of Cmtx, 0.00001 is close to zeros, indicating that the base is almost perfectly planar. The corresponding unit eigenvector, i.e., the least squares base normal is: [0.27367 0.32243 0.90617]. Of course, the geometric average of the atoms defines a point the ls-plane pass through.

Referring back to the concept of "best mean (base-pair) plane", it could simply be defined using all base atoms of the pair, or alternatively, as the mean of the two base normals. Obviously, they would lead to (slight) numerical differences. Astute viewers would also notice some other subtleties.

Sunday, August 16, 2009

Curves+ vs 3DNA

While browsing Nucleic Acids Research recently, I noticed the paper titled "Conformational analysis of nucleic acids revisited: Curves+" by Dr. Lavery et al. [advance access published online on July 22, 2009]. I read it through carefully during the weekend and played around with the software. Overall, I was fairly impressed, and also happy to see that "It [Curves+] adopts the generally accepted reference frame for nucleic acid bases and no longer shows any significant difference with analysis programs such as 3DNA for intra- or inter-base pair parameters."

Anyone who has ever worked on nucleic acid structures (especially DNA) should be familiar with Curves, an analysis program that has been widely used over the past twenty years. Only in recent years has 3DNA become popular. By and large, though, it is my opinion that 3DNA and Curves are constructive competitors in nucleic acid structure analysis with complementary functionality. As I put it six years ago, before the 13th Conversation at Albany: "Curves has special features that 3DNA does not want to repeat/compete (e.g. global parameters, groove dimension parameters). Nevertheless, we provide an option in a 3DNA utility program (find_pair) to generate input to Curves directly from a PDB data file" on June 6, 2003, and emphasized again on June 09, 2003: "We also see Curves unique in defining global parameters, bending analysis and groove dimensions." 3DNA's real strength, as demonstrated in our 2008 Nature Protocols paper, lies in its integrated approach that combines nucleic acid structure analysis, rebuilding, and visualization into a single software package (see image to the right and above).

Now the nucleic acid structure community is blessed with the new Curves+, which "is algorithmically simpler and computationally much faster than the earlier Curves approach", yet still provides its 'hallmark' curvilinear axis and "a full analysis of groove widths and depth". When I read the text, I especially liked the INTRODUCTION section, which provides a nice summary of relevant background information on nucleic acid conformational analysis. An important feature of Curves+ is its integration of the analysis of molecular dynamics trajectories. In contrast, 3DNA lacks direct support in this area (even though I know of such applications from questions posted on the 3DNA forum), mostly due to the fact that I am not an 'energetic' person. Of special note is a policy-related advantage Curves+ has over 3DNA: Curves+ is distributed freely, and with source code available. On the other hand, due to Rutgers' license constraints and various other (undocumented) reasons, 3DNA users are still having difficulty in accessing 3DNA v2.0 I compiled several months ago!

It is worth noting that the major differences in slide (+0.47 Å) and X-displacement (+0.77 Å) in Curves+ vs the old Curves (~0.5 Å and ~0.8 Å, respectively) are nearly exactly those uncovered a decade ago in "Resolving the discrepancies among nucleic acid conformational analyses" [Lu and Olson, J Mol Biol. 1999 Jan 29; 285(4):1563-75]:
Except for Curves, which defines the local frame in terms of the canonical B-DNA fiber structure (Leslie et al., 1980), the base origins are roughly coincident in the different schemes, but are significantly displaced (~0.8 Å along the positive x-axis) from the Curves reference. As illustrated below, this offset gives rise to systematic discrepancies of ~0.5 Å in slide and ~0.8 Å in global x-displacement in Curves compared with other programs, and also contributes to differences in rise at kinked steps. (p. 1566)

Please note that Curves+ has introduced new name list variables — most notably, lib= — and other subtle format changes, thus rendering the find_pair generated input files (with option '-c') no longer valid. However, it would be easy to manually edit the input file to make it work for Curves+, since the most significant part — i.e., specifying paired nucleotides — does not change. Given time and upon user request, however, I would consider to write a new script to automate the process.

Overall, it is to the user community's advantage to have both 3DNA and Curves+ or a choice between the two programs, and I am more than willing to build a bridge between them to make users' lives easier.