Xiang-Jun's Corner

Saturday, November 14, 2009

How shear affects twist angle of a dinucleotide step?

A recent post in the 3DNA forum, titled "NUPARM vs X3DNA twist values", made me to rethink the issue of how or why shear affects twist angle of a dinucleotide step.

To me, this problem has long been solved as demonstrated by the following two well-cited publications:

The Tsukuba report, a.k.a., "A Standard Reference Frame for the Description of Nucleic Acid Base-pair Geometry". When Dr. Olson and I were drafting this report, I felt clearly the need to caution the community of the intrinsic correlations between base-pair parameters and the associated step parameters (Figure 3 there) to avoid possible mis-interpretations in structural analysis. This is specially the case for the effect of shear on twist, since the G–U wobble base-pair is common in RNA and it has a ~2.0 A shear.

The 3DNA 2003 NAR paper. There is a subsection on the "Treatment of non-Watson–Crick base pairing motifs", and Figure 3 addressed specially on the issue:
"Large Shear of the G–U wobble base pair influences the calculated but not the ‘observed’ Twist. The 3DNA numerical values of Twist, 20° (top) and 43° (bottom), differ from the visualization of nearly equivalent Twist suggested by the angle between successive C1'···C1' vectors (finely dotted lines)."

It was thus a bit surprising that such question still popping up. On second thought, however, it is quite understandable: one cannot expect everyone to read that two papers; not to mention remembering such details. So I am glad that this question was brought up to my attention, and it made me thinking possible ways to document more thoroughly the many 3DNA-related "technical details" that are crucial for better understanding of nucleic acid structures.

Coming back to the shear on twist angle issue, the figure at the left shows a G–U wobble pair example (top), and a simple rationale: the base-pair is approximately of 10Å-by-5Å (as defined in SCHNArP/3DNA), so a 2Å shift will lead to an angle:

atan2(2, 10) * 180 / pi = 11.3 degrees

(i.e., the red dotted line relative to the bottom horizontal line).

To a first order approximation, that is the difference between RC8–YC6 (or C1'–C1') vs. the base-centered mean y-axis of the pair for calculating twist angle. So whenever one has a G–U wobble pair next to a normal Watson-Crick pair, there would be ~11 degrees difference in "calculated" twist angle between the two approaches (NewHelix/CEHS/SCHNAaP/NUPARM vs 3DNA/Curve+). Moreover, when a G–U wobble is next to a U–G wobble pair, the difference would be doubled to ~23 degrees!

It is worth mentioning that the issue here (as in other similar cases) is not which number is "correct" or which is "wrong": a number is a number. It is its interpretation that matters, and it is here that "details" do count.

Sunday, November 8, 2009

It's sad to hear that Warren DeLano, author of PyMOL, passed away

From a couple of mailing lists, I heard the sad news that Warren DeLano, author of PyMOL, passed away on Tuesday morning, November 3rd. He was only 37!

I have never met Dr. DeLano personally, nor even I communicated with him by email, but I am very aware of PyMol, the de facto standard nowadays for molecular graphics. In writing 3DNA Nature Protocols paper, I dug more deeply into PyMol. I was impressed by its interactive interface to .r3d files (Raster3D) and the high quality ray-traced images it produced. So I came up with a Perl script (x3dna_r3d2png) to convert automatically from a 3DNA generated .r3d file to a PNG image through the PyMol engine.

Through his seminal contributions to PyMol, Dr. DeLano achieved something very few others in computational chemistry/biology can match: he successfully mobilized literately thousands of software programmers and ordinary users from multi-disciplines to join him to produce phenomenal pictures, each of which is worth a thousand words!

It was due to Dr. DeLano's vision that he made PyMol open source so the community now has the possibility/opportunity to continue support and further improve the software. At this stage, however, no one is likely to knows PyMol code to the depth Dr. DeLano did, not to mention the leadership and enthusiasm that he brought to the project. Whatever the case, the community undoubtedly would appreciate Dr. DeLano's valuable contributions.

Thanks, Dr. DeLano, for bring PyMol to the world!

Saturday, October 31, 2009

How to calculate torsion angle?

Given the x-, y-, and z-coordinates of four points in 3-dimensional (3D) space, how to calculate torsion angle? Overall, this is a well-solved problem in structural biology, and one can find detailed description of the algorithm in text books and on-line documents. The algorithm for calculating torsion angle is implementated in virtually every software package in structural chemistry or biology.

Basic as it is, however, in my experience, it is very important to have a detailed appreciation of how the method works in order to really get into the 3D world. Here is a worked example using Octave/Matlab of my simplified, geometry-based implementation of how to calculate torsion angle, including how to determine its sign. No theory or (complicated) mathematical formula, just a step-by-step illustration of how I solve this problem.

Coordinates of four points in variable abcd:

abcd = [ 21.350  31.325  22.681
22.409  31.286  21.483
22.840  29.751  21.498
23.543  29.175  22.594 ];

Two auxiliary functions: norm_vec() to normalize a vector; get_orth_norm_vec() to get the orthogonal component (normalized) of a vector with reference to another vector, which should have been normalized.
```
function ovec = norm_vec(vec)
  ovec = vec / norm(vec);
endfunction

function ovec = get_orth_norm_vec(vec, vref)
  temp = vec - vref * dot(vec, vref);
  ovec = norm_vec(temp);
endfunction
```

Get three vectors: b_c is the normalized vector b→c; b_a_orth is the orthogonal component (normalized) of vector b→a with reference to b→c; c_d_orth is similarly defined, as the orthogonal component (normalized) of vector c→d with reference to b→c.

b_c = norm_vec(abcd(3, :) - abcd(2, :))
  % [0.2703158  -0.9627257   0.0094077]
b_a_orth = get_orth_norm_vec(abcd(1, :) - abcd(2, :), b_c)
  % [-0.62126  -0.16696   0.76561]
c_d_orth = get_orth_norm_vec(abcd(4, :) - abcd(3, :), b_c)
  % [0.41330   0.12486   0.90199]

Now the torsion angle is the angle between the two vectors, b_a_orth and c_d_orth, and can be easily calculated by their dot product. The sign of the torsion angle is determined by the relative orientation of the cross product of the same two vectors with reference to the middle vector b→c. Here they are in opposite direction, thus the torsion angle is negative.
```
angle_deg = acos(dot(b_a_orth, c_d_orth)) * 180 / pi
  % 65.609
sign = dot(cross(b_a_orth, c_d_orth), b_c)
  % -0.91075
if (sign < 0)
  ang_deg = -angle_deg  % -65.609
endif
```

A related concept is the so-called dihedral angle, or more generally the angle between two planes. As long as the normal vectors to the two corresponding planes are defined, the angle between them is easy to work out.

Moreover, the method to calculate twist angles of helical nucleic acid structures in SCHNAaP and 3DNA is essentially the same.

Friday, October 30, 2009

Upgrade to Ubuntu 9.10

I am now upgraded to Ubuntu 9.10 Karmic Koala, released On October 29 by Canonical Ltd. The system is now up and running, even though there are still some (minor) issues to be resolved. Overall, it was an exciting exploration, and Internet and Google search were essential for solving most of the problems.

Normally, I am not that quick to catch up with a new software release, but wait a few more weeks when most initial bugs have been fixed. I was quick this time mainly because I had been trapped into an earlier 9.10 alpha release (in development branch), when I tried to fix a printing problem (without success). Worse yet, suddenly, my VirtualBox to run Windows XP did not start, and SCIM Chinese input stopped functioning, etc. It caused me quite some time and trouble to get my Linux box back to work. So over the past few months, I did not perform any update, but eagerly waiting for the stable release. It is a relief that the new official 9.10 release does solve my problems.

With Ubuntu 9.10, now I have access to OpenOffice 3.1.1 and Lyx 1.6.4 for text processing, GNU Emacs 22.2.1, and GCC 4.4.1, among other things. It is nice to stay current, not only in science, but also in IT.

It is worth mentioning that Ubuntu is "an ethical concept of African origin emphasizing community, sharing and generosity." It is amazing how useful and robust the free, open-source software can be!

Friday, October 23, 2009

Sharing published data?

A letter in the recent issue of Science [VOL 326 23 OCTOBER 2009] titled "The Antidote to Bias in Research" by Allison mentioned the Nature Genetics article on the issue of wide-spread non-repeatability of published microarray gene expression studies. It also touched on another recent study titled "Empirical Study of Data Sharing by Authors Publishing in PLoS Journals" by Savage and Vickers [PLoS ONE 4(9): e7078]:

In another study, refusal to share data despite policies requiring sharing was nearly ubiquitous among authors publishing in Public Library of Science journals.

I then read the Savage and Vickers article, and was surprised to find that only one author out of 10 sent them an original data set. Therefore, the authors wrote:

In conclusion, our findings suggest that explicit journal policies requiring data sharing do not lead to authors making their data sets available to independent investigators.

I can understand why the repeatability rate is so low from the Nature Genetics paper given the so many details associated with a published table or figure. However, it is a bit hard to explain why it is so difficult to share just the original dataset associated with a published work. I would imagine that the authors would be honored that others care about their publications, and contact them directly.

In my experience, I have also been simply ignored frequently for clarifications of some details in published work, or requests for some datasets. The highest successful rate is asking for PDF represents from corresponding authors.

If the basic principles of scientific publications are strictly enforced, even just by the big journals (so many, nowadays), a lot of unfounded big claims would be gone or can be easily seen through. Well, I know that's only possible in an ideal world.

How to quantify the relative geometry between two base-pairs? — Part 2

In part I of this series, and indeed other occasions in my blog, 3DNA home page and forum, I have attributed to several related programs where 3DNA has benefited from. Before going into details on how the various parameters are calculated in 3DNA, however, I feel it is in order to make it clear my philosophy on creating scientific software in general, and 3DNA in particular.

Some underlying considerations

Whenever possible/applicable, I always try to get a thorough understanding of related software available. By thorough I mean at the source code level to see how an algorithm is implemented. The benefits of doing this are several folds:
1. Not to recreate the wheel, but to build upon previous work.
2. It is the most effective way to learn — many math formulae or text descriptions of algorithms, while useful for getting general ideas, are lack in details or vague in nature. In bioinformatics, most of the fundamental mathematics are not new. It is more about a novel combination of various known-parts, applying to a specific problem. Reading the source code is the only way to see unambiguously the implementation details.
3. With a clear understanding of how the parameters are actually calculated, one can make better use of a software tool, know its limitations and thus avoid misinterpretations.
4. To make objective, convincing (even to the original authors) comparisons/comments on existing tools, and importantly, to create something better if needed.

Specially, 3DNA has benefited the most from my SCHNAaP/SCHNArP programs: the underlying algorithms in 3DNA for calculating the various structural parameters (propeller, buckle, shear, roll, slide, x-displacement, inclination etc) are exactly the same as there — analyze and rebuild are direct derivatives from SCHNAaP (a for analysis) and SCHNArP (r for rebuilding/reconstruction), respectively; the ideas of standard stacking diagrams, the Zp parameter, building atomic models with sugar-phosphate backbone, and the base/bp rectangular block representations are also from SCHNAaP/SCHNArP. I also borrowed ideas from Babcock's RNA, Bansal's NUPARM, Dickerson's NewHelix/FreeHelix etc., which will be mentioned explicitly in following sessions.
Code wise, 3DNA was created from scratch in strict ANSI C (over 25K lines). In addition to implementing a unique combination of existing algorithms, I have added many significant novel methods including the find_pair program. However, feature-rich has never been my goal. Instead, I only consider to add new functions that I understand clearly and find useful. On the other hand, I am always quick to fix bugs.

Overall, it seems that I am an outlier rather than norm in scientific programming. The reasons could be as follows:

It is hard to understand other people's implementations of algorithms since it needs to read between the lines. This is especially the case with a unfamiliar computer language; undocumented, or bad code. It is easier to create one's own program than to understand and modify third-party software.
For publication purpose, it is more attractive to work on something "new" instead of building on others.

Thus, in bioinformatics field, there are far more "new" methods papers published than refinements of previous ones. However, what claimed/appeared to be novel may not be the case under the hook, due to different terminologies used in different fields. Moreover, while it is easy to create a self-claimed "new" program, others could simply do the same. In my experience, it is well worth the effort to really understand leading programs if one is serious about getting into an informatics field: it was really revealing and exciting when I went thorough CEHS, NewHelix/FreeHelix, RNA, Curves etc. Then I know clearly what was "the start-of-the-art" of the field and how I can do better: that's exactly how 3DNA came about!

In the following sessions, I will provide details on how 3DNA calculates the two sets of parameters to quantify the relative geometry of a dinucleotide step.

Sunday, October 18, 2009

Duplicate tab function in Adobe READER 9.1 is very handy

In reading a scientific paper, one often needs to jump back and forth for referred to tables, figures and references, etc. Those days, PDF is the standard way to share an e-document, and Acroread is the norm to view its content. I used to open two copies of the same document in two instances of Acroread (or one Acroread, one xpdf) so I can read continuously in one while moving around in the other, mostly at the end for references. This works, but obviously less than ideal.

One day, purely by chance, I (mouse) right-clicked the tab for the document I was viewing, and noticed that it popped up with two options: Detach Tab and Duplicate Tab. Clicking Duplicate Tab led to duplication of the same document in another tab. Actually, this process can be repeated more than once, thus allowing for multiple views of the same document simultaneously. Very neat!

Nowadays, whenever I read a scientific publication in Acroread, I often duplicate tab to have two views. It is only a click away to switch between the two. Thus when a citation is referred to in the main text, I can immediately see at the reference section what it is about.

A word of caution: I am using Ubuntu 9.04, with Acroread v9.1.2 05/25/2009. I have no idea from which version and on what platform such function was added.

How to quantify the relative geometry between two base-pairs? — Part 1

In the field of double helical DNA (and RNA) structures, the following two sets of parameters to normally used to quantify the relative geometry of the two base pairs in a dinucleotide step:

shift, slide, rise, tilt, roll, and twist — which I call them stacking or simple step parameters
x-displacement, y-displacement, helical rise, inclination, tip, and helical twist — which I call them helical parameters

3DNA calculates all of these parameters. Over the years, I have been approached quite a few times on the local helical parameters interpretation question. In literature, I've noticed numerous times of confusions the community still has over the two set of parameters, especially with regard to the issue of "twist vs helical twist" and "rise vs. helical rise". This series of blog posts is aimed to clarify the problem by providing some background information and step-by-step worked examples illustrating how the parameters are actually calculated in 3DNA (which follows SCHNAaP/SCHNArP). In my experience, knowing how the parameters are derived is the key to understand what they mean, and thus to avoid misinterpretations.

Part 1 — background information

The Calladine and El-Hassan Scheme (CEHS) calculates only the six step parameters, i.e., shift, slide, rise, tilt, roll, and twist. The original CEHS implementation was a few FORTRAN subroutines taking advantage the code base of Dickerson's well-known NewHelix program. In developing SCHNAaP/SCHNArP, a collaborative project between the then Sheffield and Cambridge groups, I also went through the source code of NewHelix to get a thorough knowledge of its internals in order to integrate its nice features into SCHNAaP.
One thing I noticed was x-displacement, a parameter characterizing the "hole" of A-DNA in top view. The global helical axis from NewHelix was (still is) appealing to me, especially for a short, relatively straight fragment. Even for a clearly non-straight duplex, e.g., in super helical nucleosome core particle DNA, or a severely kinked DNA, the deviation from regular linear helix serves as a good parameter to quantify its overall non-linearity. As a side note, a single so-called bending angle is often misleading — for example, the bending angle as commonly cited in literature (mostly calculated from Curves), is strongly influenced by the two terminal base-pairs.
So I devised a parallel set of global helical parameters following the CEHS scheme of angle combination: here instead of roll-tilt, I used tip-inclination. The definition of global helical axis and the point the helix passes through are based on my simplified implementation in ANSI C of the algorithms used by NewHelix.
Thus SCHNAaP calculates both a set of local CEHS step parameters, and a set of global helical parameters — a unique combination of CEHS and NewHelix, made possible only after a thorough understanding of both methods. SCHNArP was developed to complete the circle, i.e., to rebuild a structure given a set of parameters, either the local step parameters: shift, slide, rise, tilt, roll, and twist, or the global helical parameters: x-displacement, y-displacement, helical rise, inclination, tip, and helical twist.
The RNA (Running Nucleic Acids) program by Babcock et al. calculates a set of local helical parameters. While working at Rutgers in Dr. Olson's group, I was interested in knowing how the local helical axis and the point its passes through were defined in the RNA program, since it would naturally substitute for the global one in SCHNAaP/SCHNArP to make two sets of purely local parameters. As it turned out, however, the algorithm was finally implemented in the 3DNA software package, as summarized in the email I sent out before the 13th Conversation at Albany:
The way to calculate the local helical axis and helical parameters in 3DNA is *essentially the same* as in RNA, but expressed in a much simpler way. First, the local helical axis is calculated as dx-times-dy, following Bansal, which gives the *same* result as the "single rotation axis" detailed in the RNA paper. I still could not figure out WHY, but verified this numerically. The location where the helix passes through is based directly on the RNA paper (i.e., following Chasles's theorem). The procedure to calculate helical parameters, i.e., tip/inclination/x-disp/y-disp/etc, following the SCHNAaP paper (i.e, with tip-inclination combination, which is consistent with the roll-tilt, propeller-buckle combinations in 3DNA). What amazed us is that this much simpler tip-inclination implementation in 3DNA gives *exactly* the same numerical values as the original RNA algorithm.
In summary, the two sets of local parameters as implemented in 3DNA are based on a unique combination of nice features from several programs well-known in the community, including SCHNAaP/SCHNArP (CEHS), NewHelix, RNA, and NUPARM. However, the 3DNA calculated parameters are numerically different from any of them, partially because 3DNA adopts the standard base reference frame. To the best of my knowledge, 3DNA is the only software that allows for rigorous conversion between the two sets of local parameters — a utility program in 3DNA called step_hel illustrates this simple fact. Again, as I wrote before the 13th Conversation at Albany:
Take each base-pair as a rigid block, only 6 parameters are required to relate one bp to the other. Two sets are commonly used: one set is "shift, slide, rise, tilt, roll and twist", and the other set is "x-displacement, y-displacement, helical rise, inclination, tip, and helical twist". Obviously, these two sets of parameters should be directly convertible, as demonstrated by Calladine and Drew in their 1984 JMB B-to-A transition paper, and illustrated in 3DNA manuscript.
Links to other programs referred to in this section:
- SCHNAaP/SCHNArP by Lu et al.
- The RNA program by Babcock et al.
- FreeHelix98 by Dickerson
- NUPARM from Banal et al.
- Standards, especially the email exchanges before the 13th Conversation at Albany

Saturday, October 10, 2009

Blogger's duo-editing modes allow for flexibility and convenience

In my experience using Blogger over the past several months, I have begun to appreciate more its double editing mode: Compose and Edit HTML. The Compose mode is a simple WYSIWYG editor, convenient for most common tasks. Once in a while, however, I get stuck with some nasty formatting issues that could drive one crazy to fix. This is where the Edit HTML mode comes in handy, which allows for full flexibility in editing raw HTML.

In principle, I am pretty competent with HTML and could use the Edit HTML mode directly. However, raw HTML is verbose and thus not that convenient. So I normally start a blog post with the Compose mode for most of the content, and switch to the Edit HTML mode only when necessary. Blogger makes the switch between the two modes a simple button click. This is in contrast to the single editing mode in phpBB3 (BBCode) used by the 3DNA forum, which does not allow for direct access to HTML.

Ideally, a software tool should be both flexible and convenient. In reality, however, not that many software could strike a balance between the two factors. Blogger is a nice example. Interesting, in composing this post, I switched back and forth between the two editing modes a couple of occasions: one after copy-and-pasting Edit HTML to stop red coloring of the following text, and the other to qualify the 3DNA forum link text.