Saturday, December 19, 2009

ORCID -- an international research identification system?

From the Nature news article titled "Credit where credit is due" in (462:7275, p. 825 on December 17, 2009), I came cross the ORCID initiative:
Name ambiguity and attribution are persistent, critical problems imbedded in the scholarly research ecosystem. The ORCID Initiative represents a community effort to establish an open, independent registry that is adopted and embraced as the industry’s de facto standard. Our mission is to resolve the systemic name ambiguity, by means of assigning unique identifiers linkable to an individual's research output, to enhance the scientific discovery process and improve the efficiency of funding and collaboration.
Overall, I think it is a good idea. If properly implemented and widely adopted, ORCID could help solve lots of issues associated with various ways of spelling a person's name due to, e.g., cultural differences. For example, put Chinese way, one's family name comes before one's given name, just the opposite of the western convention. Additionally, when a given name has two characters (quite common), there are could be a space or a hyphen (as I normally put in Xiang-Jun) or nothing in between. Combined with possible first name initials, there are already many ways to spell out a Chinese name.

The above Nature article, “Credit Where Credit is Due”, helps introduce the ORCID Initiative. As an specific example, it points to another article on page 843, where Nature profiles a research group trying to "complete the reference human genome sequence, which is still full of errors nearly a decade after the first draft was announced in 2000." Nature acknowledges that "It is essential work", "But it is also work that offers few academic rewards beyond the satisfaction of a job well done — it is unlikely to result in a high profile publication." Hopefully, by adopting the ORCID system, contributions of such types (e.g., software support and maintenance) would be more properly acknowledged by the scientific community.

Given the high profile of the founding parties, I am hopeful that the ORCID initiative would move forward as promised. I will keep an eye on it and see how it evolves.

Friday, December 18, 2009

Ribosomal structure: it helps to know some background information

This year's Nobel Prize in Chemistry has been awarded to Venki Ramakrishnan, Tom Steitz, and Ada Yonath "for studies of the structure and function of the ribosome."

My connection with ribosomal structure began with the 50S large subunit of Haloarcula marismortui solved by Tom Steitz's group. Ever since the fully refined crystal structure at 2.4 Å resolution was published in 2001 (PDB entry: 1jj2; NDB code: rr0033), I have been using it to check 3DNA's applicability. In the two 3DNA papers (2003 NAR and 2008 NP), 1jj2 was used as an example to illustrate how find_pair can identify higher-order base-associations in complicated RNA containing structures. At the time, though, my understanding of the ribosomal RNA structure was purely geometrical: for quite a while, I got overwhelmed by the various biological terminologies, including the various S-es: 50S large ribosomal subunit vs. the 23S and 5S rRNA; and of course, the 30S small subunit vs. 16S rRNA.

Over the past year or so, I have become more interested in RNA structures. After reading a lot of related articles, gradually I feel things are becoming clearer than before. Nevertheless, there is something still missing, since my focus has (mostly) been on recent X-ray crystal structure-related work. My understanding of the ribosomal structure was finally put into context, thanks to following two recent publications:
These two papers not only summarized the significance of work of the three Nobel laureates — "the atomic resolution structures of the ribosomal subunits provide an extraordinary context for understanding one of the most fundamental aspects of cellular function: protein synthesis" — but also provided background information of decades of work from other players, including Harry Noller, Peter Moore, and Joachim Frank. Solving the ribosomal structure serves as a good example of how the fact that scientific research is both cooperative and competitive in nature.

Friday, December 11, 2009

Not all PDB entries are reliable; some could be plain fake

With interest, I have browsed the recent thread in the PDB mailing list (pdb-l), "Retraction of 12 Structures" posted by Michael Sadowski and followed-up by Kevin Karplus et al. The story is about Krishna Murthy, a former scientist at the University of Alabama at Birmingham (UAB), who has been alleged to fabricate protein structures and published papers on them. Here is an informative comment by firebug36 from the above link:
I am a protein crystallographer myself, so just trust me - the results this gentleman [Murthy] published were falsified, and not in a smart way. The structures [for C3b] deposited in the Protein Data Bank made no physical sense.

Allegations against UAB group were first brought to light by several prominent people in the field, and not UAB officials:

http://www.nature.com/nature/journal/v448/n7154/full/nature06102.html

Accordingly to the post of Kevin Karplus, "several of the PDB files by Krishna Murthy's group were identified as problematic in the RosettaHoles paper". Naturally, then, comes the question, "should we remove ALL the PDB files from Krishna Murthy's group as suspect?"

The way Murthy's case coming to spotlight may represent an exception rather than norm. Imagine the scenario that he did not publish his C3b structure in Nature which caught the attention from leading crystallographers (Bert Janssen1, Randy Read2, Axel Brünger and Piet Gros), maybe Murthy is still publishing on protein structures today. In a sense, it is a hard to believe how Murthy could falsify 12 protein structures and published 9 papers in prestigious journals (including Nature, Cell, PNAS, JMB, Biochemistry, JBC etc) which have been cited 449 times.

PDB contains the state-of-the-art experimental data of bio-macromolecular structures. Yet, the archive is certainly full of inconsistencies/errors of various types. It would be helpful to know how many PDB entries are largely or partially wrong, and which can be taken as "gold standard" as far data quality is concerned.

This case gives an excellent lesson for those performing data-mining on macromolecular structures. Nowadays, PDB structures are many and keep increasing rapidly, but they are clearly of varying quality. Structural bioinformatics is about solving biology problems using informatics tools. Thus knowing the caveats of your data (how reliable are they?) and tools (what are their limitations?) is a prerequisite to draw sound scientific conclusions.

Sunday, December 6, 2009

3DNA in the PCCP nucleic acid simulations themed issue

While checking 3DNA-related citations through Web of Science for this past week, I found a total of nine times, as follows:
  1. Five times to the 3DNA 2003 NAR paper
  2. Once to the 3DNA 2008 NP paper
  3. Three times to the 2001 standard base reference frame paper
Most interestingly, all the citations are from the same nucleic acid simulations themed issue of Physical Chemistry Chemical Physics 11 (45). Honestly, I was quite a bit (nicely) surprised by the fact, so I browsed the articles online. Edited by Charles Laughton and Modesto Orozco, the 2009 PCCP "themed issue exemplifies the rich diversity of cutting-edge research in the field of nucleic acids simulation." Indeed, quite a few well-known experts are among the authors of the two perspectives and 16 papers.

While not an "energetic" person myself, over the years I have been keeping an eye on MD simulations and MM calculations of nucleic acid structures. It is my pleasure to see that the 3DNA is being widely used (certainly more than I originally expected) by the nucleic acid simulations community. Given time, and with a suitable collaborator, I am open to consider adapting 3DNA to currently available MD simulation packages to make life easier for practitioners in this "dynamic" field.

Saturday, December 5, 2009

See the effect of C preprocessing with the -E option of gcc

Recently, I was interested in understanding better of one part of a C program, which is very generic, covering a lot of grounds with preprocessing options (#if ... #endif). However, I would like to see the minimum that covers the section I cared about. I vaguely remembered there is an option in the gcc compiler, from reading the book "An Introduction to GCC" a while ago, that can stop the process right after the preprocessing step (just like -c option stops after the compilation step). A quick check of "man gcc" revealed that it is -E:
-E Stop after the preprocessing stage; do not run the compiler proper. The output is in the form of preprocessed source code, which is sent to the standard output.

Input files which don't require preprocessing are ignored.

After setting the proper macros and running gcc with -E, I could immediately focus on the component to get what I wanted.

I have been using gcc for more than ten years, still there are more handy tricks to uncover.


Here is a simplified example showing the effect of the -E option. The following C function is saved in file "check_gcc_E.c".

--------------------------------------------------------------------------
#define GREETING "Hello World"

void check_gcc_E(void) {
#ifdef VERBOSE
printf("Macro VERBOSE is defined.\n");
printf("So you see: '%s'\n", GREETING);
#endif
printf("Hello everyone!\n");
}
--------------------------------------------------------------------------

gcc -E -DVERBOSE check_gcc_E.c

--------------------------------------------------------------------------
# 1 "check_gcc_E.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "check_gcc_E.c"

void check_gcc_E(void) {

printf("Macro VERBOSE defined\n");
printf("So you see: '%s'\n", "Hello World");

printf("Hello everyone!\n");
}
--------------------------------------------------------------------------

gcc -E check_gcc_E.c

--------------------------------------------------------------------------
# 1 "check_gcc_E.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "check_gcc_E.c"

void check_gcc_E(void) {
printf("Hello everyone!\n");
}
--------------------------------------------------------------------------

Friday, November 20, 2009

Registered COPPA Users in phpBB3

Recently, I received an email from a 3DNA user who registered at the 3DNA forum, but could not see anything at all once logged in. 3DNA forum is based on phpBB3, and has been running for over three years now. So at the very beginning, I thought how could it be? I had never heard of any such problem/complain from 3DNA forum registers before. I even created a temporary test login account and found no problem. So I communicated with the user and asked her/him to log in using my test account, and again everything was fine!

To reproduce the problem, I logged in as the user, and found one thing spurious: the user was in the group of "Registered COPPA Users", not the normal "Registered Users". I did not know what that COPPA stands for. So I googled the phase "Registered COPPA users" and the top hit led me into the phpBB3 document on Group Management, and the section I am interested in reads as follows:
Registered COPPA users are basically the same as registered users, except that they fall under the COPPA, or Child Online Privacy Protection Act, law, meaning that they are under the age of 13 in the U.S.A. Managing the permissions this usergroup has is important in protecting these users. COPPA doesn't apply to users living outside of the U.S.A. and can be disabled altogether.

So a registered COPPA user is, by definition, under the age of 13. By default phpBB3 does not even allow such a child to read any content! In the context of 3DNA forum, this policy simply does not make any sense — the contents (in the public section) are viewable by any one without registration.

It turned out that at registration stage, the first question is: "To continue with the registration procedure please tell us when you were born." Two dynamically generated dates are given, one is "Before" a date defining an age over 13, and the other "On or after" it for below 13. Obviously the 3DNA user mentioned above clicked the wrong button.

After knowing where the problem was and how it was created, fixing it was straightforward. Interestingly, when I then checked the 3DNA forum registered users, I found five of them were in the "Registered COPPA Users" group. Obviously, the previous (wrong) registers did not complain — possibly lost interest in pursuing further, so this issue did not surfaced until recently.

In a real world as we live, what seems simple may not be. Nothing should be taken for granted.

3DNA in PDB

As mentioned previously, PDB makes use of blocview (part of 3DNA) to generate the simple yet effective images for nucleic-acid-containing structures. That's the connection I knew of between 3DNA and PDB. By pure chance, however, I recently noticed the 3DNA entry in PDB — it is actually a protein structure, completely unrelated to the 3DNA software package!

Just out of curiosity, I browsed the abstract of the Liu et al. article, titled "Halogenated benzenes bound within a non-polar cavity in T4 lysozyme provide examples of I...S and I...Se halogen-bonding" [J Mol Biol. 2009 Jan 16;385(2):595-605]. I then downloaded the full PDF version of the paper and read it carefully through. This work studied binding interactions of benzenes with the internal cavity of L99A mutated T4 lysozyme. The authors demonstrated that the center of the phenyl ring can be shifted by more than one angstrom due to different halogen-substitutions (where the 3DNA entry corresponds to C6H5I), and (further) proved the concept that "the protein is flexible and adapts to the size and shape of the ligand". At better than 2.0 Å resolution, they also observed the I...S and I...Se halogen-bonds.

I became interested in this paper not just because of the name of 3DNA, for which a quick browsing over the abstract would be sufficient. This paper also reminded me of an early article I published with the title "Influence of fluorine on aromatic interactions":
Non-covalent interactions between aromatic ligands influence the conformations of metal complexes, and the system [M(OAr)2L2] has been used to investigate the difference between phenyl–phenyl, phenyl–pentafluorophenyl and pentafluorophenyl–pentafluorophenyl interactions. X-Ray crystal structures show that pentafluorophenyl groups adopt partially stacked orientations with the two aromatic rings close to parallel and with significant π overlap. In contrast, phenyl groups are skewed away from each other with only edge-to-face contacts. Phenyl–pentafluorophenyl interactions adopt a coplanar fully stacked geometry. These results have been rationalised on the basis of energy calculations (carried out blind) using a variety of empirical models for treating weak non-covalent interactions. The major cause of the different behaviour of the three systems lies in the electrostatic interactions between the π systems.

Knowing of the pattern of a PDB id, — 4 characters long: the first character is a numeral in the range 0-9, while the rest can be either numerals or letters — I played around with some other possible ids with my name initials in it. Indeed I found one, 1XJL, a protein structure of human annexin A2 in the presence of calcium ions. If you are bimolecular structure-oriented, why not have a try with some ids of special meaning to you — you might be related to PDB in some unexpected way!

Saturday, November 14, 2009

How shear affects twist angle of a dinucleotide step?

A recent post in the 3DNA forum, titled "NUPARM vs X3DNA twist values", made me to rethink the issue of how or why shear affects twist angle of a dinucleotide step.

To me, this problem has long been solved as demonstrated by the following two well-cited publications:
  1. The Tsukuba report, a.k.a., "A Standard Reference Frame for the Description of Nucleic Acid Base-pair Geometry". When Dr. Olson and I were drafting this report, I felt clearly the need to caution the community of the intrinsic correlations between base-pair parameters and the associated step parameters (Figure 3 there) to avoid possible mis-interpretations in structural analysis. This is specially the case for the effect of shear on twist, since the G–U wobble base-pair is common in RNA and it has a ~2.0 A shear.

  2. The 3DNA 2003 NAR paper. There is a subsection on the "Treatment of non-Watson–Crick base pairing motifs", and Figure 3 addressed specially on the issue:
    "Large Shear of the G–U wobble base pair influences the calculated but not the ‘observed’ Twist. The 3DNA numerical values of Twist, 20° (top) and 43° (bottom), differ from the visualization of nearly equivalent Twist suggested by the angle between successive C1'···C1' vectors (finely dotted lines)."
It was thus a bit surprising that such question still popping up. On second thought, however, it is quite understandable: one cannot expect everyone to read that two papers; not to mention remembering such details. So I am glad that this question was brought up to my attention, and it made me thinking possible ways to document more thoroughly the many 3DNA-related "technical details" that are crucial for better understanding of nucleic acid structures.

Coming back to the shear on twist angle issue, the figure at the left shows a G–U wobble pair example (top), and a simple rationale: the base-pair is approximately of 10Å-by-5Å (as defined in SCHNArP/3DNA), so a 2Å shift will lead to an angle:
atan2(2, 10) * 180 / pi = 11.3 degrees
(i.e., the red dotted line relative to the bottom horizontal line).

To a first order approximation, that is the difference between RC8–YC6 (or C1'–C1') vs. the base-centered mean y-axis of the pair for calculating twist angle. So whenever one has a G–U wobble pair next to a normal Watson-Crick pair, there would be ~11 degrees difference in "calculated" twist angle between the two approaches (NewHelix/CEHS/SCHNAaP/NUPARM vs 3DNA/Curve+). Moreover, when a G–U wobble is next to a U–G wobble pair, the difference would be doubled to ~23 degrees!

It is worth mentioning that the issue here (as in other similar cases) is not which number is "correct" or which is "wrong": a number is a number. It is its interpretation that matters, and it is here that "details" do count.

Sunday, November 8, 2009

It's sad to hear that Warren DeLano, author of PyMOL, passed away

From a couple of mailing lists, I heard the sad news that Warren DeLano, author of PyMOL, passed away on Tuesday morning, November 3rd. He was only 37!

I have never met Dr. DeLano personally, nor even I communicated with him by email, but I am very aware of PyMol, the de facto standard nowadays for molecular graphics. In writing 3DNA Nature Protocols paper, I dug more deeply into PyMol. I was impressed by its interactive interface to .r3d files (Raster3D) and the high quality ray-traced images it produced. So I came up with a Perl script (x3dna_r3d2png) to convert automatically from a 3DNA generated .r3d file to a PNG image through the PyMol engine.

Through his seminal contributions to PyMol, Dr. DeLano achieved something very few others in computational chemistry/biology can match: he successfully mobilized literately thousands of software programmers and ordinary users from multi-disciplines to join him to produce phenomenal pictures, each of which is worth a thousand words!

It was due to Dr. DeLano's vision that he made PyMol open source so the community now has the possibility/opportunity to continue support and further improve the software. At this stage, however, no one is likely to knows PyMol code to the depth Dr. DeLano did, not to mention the leadership and enthusiasm that he brought to the project. Whatever the case, the community undoubtedly would appreciate Dr. DeLano's valuable contributions.

Thanks, Dr. DeLano, for bring PyMol to the world!

Saturday, October 31, 2009

How to calculate torsion angle?

Given the x-, y-, and z-coordinates of four points in 3-dimensional (3D) space, how to calculate torsion angle? Overall, this is a well-solved problem in structural biology, and one can find detailed description of the algorithm in text books and on-line documents. The algorithm for calculating torsion angle is implementated in virtually every software package in structural chemistry or biology.

Basic as it is, however, in my experience, it is very important to have a detailed appreciation of how the method works in order to really get into the 3D world. Here is a worked example using Octave/Matlab of my simplified, geometry-based implementation of how to calculate torsion angle, including how to determine its sign. No theory or (complicated) mathematical formula, just a step-by-step illustration of how I solve this problem.
  1. Coordinates of four points in variable abcd:
    abcd = [ 21.350  31.325  22.681
    22.409 31.286 21.483
    22.840 29.751 21.498
    23.543 29.175 22.594 ];
  2. Two auxiliary functions: norm_vec() to normalize a vector; get_orth_norm_vec() to get the orthogonal component (normalized) of a vector with reference to another vector, which should have been normalized.
    function ovec = norm_vec(vec)
    ovec = vec / norm(vec);
    endfunction

    function ovec = get_orth_norm_vec(vec, vref)
    temp = vec - vref * dot(vec, vref);
    ovec = norm_vec(temp);
    endfunction
  3. Get three vectors: b_c is the normalized vector bc; b_a_orth is the orthogonal component (normalized) of vector ba with reference to bc; c_d_orth is similarly defined, as the orthogonal component (normalized) of vector cd with reference to bc.
    b_c = norm_vec(abcd(3, :) - abcd(2, :))
    % [0.2703158 -0.9627257 0.0094077]
    b_a_orth = get_orth_norm_vec(abcd(1, :) - abcd(2, :), b_c)
    % [-0.62126 -0.16696 0.76561]
    c_d_orth = get_orth_norm_vec(abcd(4, :) - abcd(3, :), b_c)
    % [0.41330 0.12486 0.90199]
  4. Now the torsion angle is the angle between the two vectors, b_a_orth and c_d_orth, and can be easily calculated by their dot product. The sign of the torsion angle is determined by the relative orientation of the cross product of the same two vectors with reference to the middle vector bc. Here they are in opposite direction, thus the torsion angle is negative.
    angle_deg = acos(dot(b_a_orth, c_d_orth)) * 180 / pi
    % 65.609
    sign = dot(cross(b_a_orth, c_d_orth), b_c)
    % -0.91075
    if (sign < 0)
    ang_deg = -angle_deg % -65.609
    endif
A related concept is the so-called dihedral angle, or more generally the angle between two planes. As long as the normal vectors to the two corresponding planes are defined, the angle between them is easy to work out.

Moreover, the method to calculate twist angles of helical nucleic acid structures in SCHNAaP and 3DNA is essentially the same.

Friday, October 30, 2009

Upgrade to Ubuntu 9.10

I am now upgraded to Ubuntu 9.10 Karmic Koala, released On October 29 by Canonical Ltd. The system is now up and running, even though there are still some (minor) issues to be resolved. Overall, it was an exciting exploration, and Internet and Google search were essential for solving most of the problems.

Normally, I am not that quick to catch up with a new software release, but wait a few more weeks when most initial bugs have been fixed. I was quick this time mainly because I had been trapped into an earlier 9.10 alpha release (in development branch), when I tried to fix a printing problem (without success). Worse yet, suddenly, my VirtualBox to run Windows XP did not start, and SCIM Chinese input stopped functioning, etc. It caused me quite some time and trouble to get my Linux box back to work. So over the past few months, I did not perform any update, but eagerly waiting for the stable release. It is a relief that the new official 9.10 release does solve my problems.

With Ubuntu 9.10, now I have access to OpenOffice 3.1.1 and Lyx 1.6.4 for text processing, GNU Emacs 22.2.1, and GCC 4.4.1, among other things. It is nice to stay current, not only in science, but also in IT.

It is worth mentioning that Ubuntu is "an ethical concept of African origin emphasizing community, sharing and generosity." It is amazing how useful and robust the free, open-source software can be!

Friday, October 23, 2009

Sharing published data?

A letter in the recent issue of Science [VOL 326 23 OCTOBER 2009] titled "The Antidote to Bias in Research" by Allison mentioned the Nature Genetics article on the issue of wide-spread non-repeatability of published microarray gene expression studies. It also touched on another recent study titled "Empirical Study of Data Sharing by Authors Publishing in PLoS Journals" by Savage and Vickers [PLoS ONE 4(9): e7078]:
In another study, refusal to share data despite policies requiring sharing was nearly ubiquitous among authors publishing in Public Library of Science journals.
I then read the Savage and Vickers article, and was surprised to find that only one author out of 10 sent them an original data set. Therefore, the authors wrote:
In conclusion, our findings suggest that explicit journal policies requiring data sharing do not lead to authors making their data sets available to independent investigators.
I can understand why the repeatability rate is so low from the Nature Genetics paper given the so many details associated with a published table or figure. However, it is a bit hard to explain why it is so difficult to share just the original dataset associated with a published work. I would imagine that the authors would be honored that others care about their publications, and contact them directly.

In my experience, I have also been simply ignored frequently for clarifications of some details in published work, or requests for some datasets. The highest successful rate is asking for PDF represents from corresponding authors.

If the basic principles of scientific publications are strictly enforced, even just by the big journals (so many, nowadays), a lot of unfounded big claims would be gone or can be easily seen through. Well, I know that's only possible in an ideal world.

How to quantify the relative geometry between two base-pairs? — Part 2

In part I of this series, and indeed other occasions in my blog, 3DNA home page and forum, I have attributed to several related programs where 3DNA has benefited from. Before going into details on how the various parameters are calculated in 3DNA, however, I feel it is in order to make it clear my philosophy on creating scientific software in general, and 3DNA in particular.

Some underlying considerations

  • Whenever possible/applicable, I always try to get a thorough understanding of related software available. By thorough I mean at the source code level to see how an algorithm is implemented. The benefits of doing this are several folds:
    1. Not to recreate the wheel, but to build upon previous work.
    2. It is the most effective way to learn — many math formulae or text descriptions of algorithms, while useful for getting general ideas, are lack in details or vague in nature. In bioinformatics, most of the fundamental mathematics are not new. It is more about a novel combination of various known-parts, applying to a specific problem. Reading the source code is the only way to see unambiguously the implementation details.
    3. With a clear understanding of how the parameters are actually calculated, one can make better use of a software tool, know its limitations and thus avoid misinterpretations.
    4. To make objective, convincing (even to the original authors) comparisons/comments on existing tools, and importantly, to create something better if needed.
  • Specially, 3DNA has benefited the most from my SCHNAaP/SCHNArP programs: the underlying algorithms in 3DNA for calculating the various structural parameters (propeller, buckle, shear, roll, slide, x-displacement, inclination etc) are exactly the same as there — analyze and rebuild are direct derivatives from SCHNAaP (a for analysis) and SCHNArP (r for rebuilding/reconstruction), respectively; the ideas of standard stacking diagrams, the Zp parameter, building atomic models with sugar-phosphate backbone, and the base/bp rectangular block representations are also from SCHNAaP/SCHNArP. I also borrowed ideas from Babcock's RNA, Bansal's NUPARM, Dickerson's NewHelix/FreeHelix etc., which will be mentioned explicitly in following sessions.
  • Code wise, 3DNA was created from scratch in strict ANSI C (over 25K lines). In addition to implementing a unique combination of existing algorithms, I have added many significant novel methods including the find_pair program. However, feature-rich has never been my goal. Instead, I only consider to add new functions that I understand clearly and find useful. On the other hand, I am always quick to fix bugs.
Overall, it seems that I am an outlier rather than norm in scientific programming. The reasons could be as follows:
  1. It is hard to understand other people's implementations of algorithms since it needs to read between the lines. This is especially the case with a unfamiliar computer language; undocumented, or bad code. It is easier to create one's own program than to understand and modify third-party software.
  2. For publication purpose, it is more attractive to work on something "new" instead of building on others.
Thus, in bioinformatics field, there are far more "new" methods papers published than refinements of previous ones. However, what claimed/appeared to be novel may not be the case under the hook, due to different terminologies used in different fields. Moreover, while it is easy to create a self-claimed "new" program, others could simply do the same. In my experience, it is well worth the effort to really understand leading programs if one is serious about getting into an informatics field: it was really revealing and exciting when I went thorough CEHS, NewHelix/FreeHelix, RNA, Curves etc. Then I know clearly what was "the start-of-the-art" of the field and how I can do better: that's exactly how 3DNA came about!

In the following sessions, I will provide details on how 3DNA calculates the two sets of parameters to quantify the relative geometry of a dinucleotide step.

Sunday, October 18, 2009

Duplicate tab function in Adobe READER 9.1 is very handy

In reading a scientific paper, one often needs to jump back and forth for referred to tables, figures and references, etc. Those days, PDF is the standard way to share an e-document, and Acroread is the norm to view its content. I used to open two copies of the same document in two instances of Acroread (or one Acroread, one xpdf) so I can read continuously in one while moving around in the other, mostly at the end for references. This works, but obviously less than ideal.

One day, purely by chance, I (mouse) right-clicked the tab for the document I was viewing, and noticed that it popped up with two options: Detach Tab and Duplicate Tab. Clicking Duplicate Tab led to duplication of the same document in another tab. Actually, this process can be repeated more than once, thus allowing for multiple views of the same document simultaneously. Very neat!

Nowadays, whenever I read a scientific publication in Acroread, I often duplicate tab to have two views. It is only a click away to switch between the two. Thus when a citation is referred to in the main text, I can immediately see at the reference section what it is about.

A word of caution: I am using Ubuntu 9.04, with Acroread v9.1.2 05/25/2009. I have no idea from which version and on what platform such function was added.

How to quantify the relative geometry between two base-pairs? — Part 1

In the field of double helical DNA (and RNA) structures, the following two sets of parameters to normally used to quantify the relative geometry of the two base pairs in a dinucleotide step:
  • shift, slide, rise, tilt, roll, and twist — which I call them stacking or simple step parameters
  • x-displacement, y-displacement, helical rise, inclination, tip, and helical twist — which I call them helical parameters
3DNA calculates all of these parameters. Over the years, I have been approached quite a few times on the local helical parameters interpretation question. In literature, I've noticed numerous times of confusions the community still has over the two set of parameters, especially with regard to the issue of "twist vs helical twist" and "rise vs. helical rise". This series of blog posts is aimed to clarify the problem by providing some background information and step-by-step worked examples illustrating how the parameters are actually calculated in 3DNA (which follows SCHNAaP/SCHNArP). In my experience, knowing how the parameters are derived is the key to understand what they mean, and thus to avoid misinterpretations.

Part 1 — background information

  1. The Calladine and El-Hassan Scheme (CEHS) calculates only the six step parameters, i.e., shift, slide, rise, tilt, roll, and twist. The original CEHS implementation was a few FORTRAN subroutines taking advantage the code base of Dickerson's well-known NewHelix program. In developing SCHNAaP/SCHNArP, a collaborative project between the then Sheffield and Cambridge groups, I also went through the source code of NewHelix to get a thorough knowledge of its internals in order to integrate its nice features into SCHNAaP.

    One thing I noticed was x-displacement, a parameter characterizing the "hole" of A-DNA in top view. The global helical axis from NewHelix was (still is) appealing to me, especially for a short, relatively straight fragment. Even for a clearly non-straight duplex, e.g., in super helical nucleosome core particle DNA, or a severely kinked DNA, the deviation from regular linear helix serves as a good parameter to quantify its overall non-linearity. As a side note, a single so-called bending angle is often misleading — for example, the bending angle as commonly cited in literature (mostly calculated from Curves), is strongly influenced by the two terminal base-pairs.

    So I devised a parallel set of global helical parameters following the CEHS scheme of angle combination: here instead of roll-tilt, I used tip-inclination. The definition of global helical axis and the point the helix passes through are based on my simplified implementation in ANSI C of the algorithms used by NewHelix.

    Thus SCHNAaP calculates both a set of local CEHS step parameters, and a set of global helical parameters — a unique combination of CEHS and NewHelix, made possible only after a thorough understanding of both methods. SCHNArP was developed to complete the circle, i.e., to rebuild a structure given a set of parameters, either the local step parameters: shift, slide, rise, tilt, roll, and twist, or the global helical parameters: x-displacement, y-displacement, helical rise, inclination, tip, and helical twist.

  2. The RNA (Running Nucleic Acids) program by Babcock et al. calculates a set of local helical parameters. While working at Rutgers in Dr. Olson's group, I was interested in knowing how the local helical axis and the point its passes through were defined in the RNA program, since it would naturally substitute for the global one in SCHNAaP/SCHNArP to make two sets of purely local parameters. As it turned out, however, the algorithm was finally implemented in the 3DNA software package, as summarized in the email I sent out before the 13th Conversation at Albany:
    The way to calculate the local helical axis and helical parameters in 3DNA is *essentially the same* as in RNA, but expressed in a much simpler way. First, the local helical axis is calculated as dx-times-dy, following Bansal, which gives the *same* result as the "single rotation axis" detailed in the RNA paper. I still could not figure out WHY, but verified this numerically. The location where the helix passes through is based directly on the RNA paper (i.e., following Chasles's theorem). The procedure to calculate helical parameters, i.e., tip/inclination/x-disp/y-disp/etc, following the SCHNAaP paper (i.e, with tip-inclination combination, which is consistent with the roll-tilt, propeller-buckle combinations in 3DNA). What amazed us is that this much simpler tip-inclination implementation in 3DNA gives *exactly* the same numerical values as the original RNA algorithm.
  3. In summary, the two sets of local parameters as implemented in 3DNA are based on a unique combination of nice features from several programs well-known in the community, including SCHNAaP/SCHNArP (CEHS), NewHelix, RNA, and NUPARM. However, the 3DNA calculated parameters are numerically different from any of them, partially because 3DNA adopts the standard base reference frame. To the best of my knowledge, 3DNA is the only software that allows for rigorous conversion between the two sets of local parameters — a utility program in 3DNA called step_hel illustrates this simple fact. Again, as I wrote before the 13th Conversation at Albany:
    Take each base-pair as a rigid block, only 6 parameters are required to relate one bp to the other. Two sets are commonly used: one set is "shift, slide, rise, tilt, roll and twist", and the other set is "x-displacement, y-displacement, helical rise, inclination, tip, and helical twist". Obviously, these two sets of parameters should be directly convertible, as demonstrated by Calladine and Drew in their 1984 JMB B-to-A transition paper, and illustrated in 3DNA manuscript.
  4. Links to other programs referred to in this section:

Saturday, October 10, 2009

Blogger's duo-editing modes allow for flexibility and convenience

In my experience using Blogger over the past several months, I have begun to appreciate more its double editing mode: Compose and Edit HTML. The Compose mode is a simple WYSIWYG editor, convenient for most common tasks. Once in a while, however, I get stuck with some nasty formatting issues that could drive one crazy to fix. This is where the Edit HTML mode comes in handy, which allows for full flexibility in editing raw HTML.

In principle, I am pretty competent with HTML and could use the Edit HTML mode directly. However, raw HTML is verbose and thus not that convenient. So I normally start a blog post with the Compose mode for most of the content, and switch to the Edit HTML mode only when necessary. Blogger makes the switch between the two modes a simple button click. This is in contrast to the single editing mode in phpBB3 (BBCode) used by the 3DNA forum, which does not allow for direct access to HTML.

Ideally, a software tool should be both flexible and convenient. In reality, however, not that many software could strike a balance between the two factors. Blogger is a nice example. Interesting, in composing this post, I switched back and forth between the two editing modes a couple of occasions: one after copy-and-pasting Edit HTML to stop red coloring of the following text, and the other to qualify the 3DNA forum link text.

Friday, October 9, 2009

Fiber models in 3DNA make it easy to build regular DNA helices

3DNA contains 55 fiber models compiled from literature. To the best of my knowledge, this is the most comprehensive collection of its kind (detailed list given below), including:
  1. Chandrasekaran & Arnott (from #1 to #43) — the most well-known set of fiber models
  2. Alexeev et al. (#44-#45)
  3. van Dam & Levitt (#46-#47)
  4. Premilat & Albiser (#48-#55)
The utility program fiber puts the generation of all these fiber models in a simple, consistent interface, and can generate structures in either standard PDB or PDBML format. Of those models, some can be built with an arbitrary sequence of A, C, G and T (e.g., A-/B-/C-DNA from calf thymus), while others are of fixed sequences (e.g., Z-DNA with GC repeats). The sequence can be specified either from command-line or a plain text file.

Once 3DNA in properly installed, the command-line interface is the most versatile and convenient way to generate, e.g., a regular double strand DNA (mostly, B-DNA) of arbitrary sequence. Moreover, the w3DNA and 3DART web-interfaces to 3DNA make it easy to generate a regular DNA model, especially for occasional use or educational purposes.

Theoretically, there is nothing to worth showing off in 3DNA's fiber model generation functionality. However, it serves as a clear example of the differences between a "proof of concept" and a practical software application. I initially decided to work on this issue simply for my own convenience. At that time, I had access to A-DNA and B-DNA fiber model generators, each as a separate program. Moreover, the constructed models did not comply to the PDB format in atom naming and other subtitles.

I started with the Chandrasekaran & Arnott fiber models which I had a copy of data files. Nevertheless, there are many details to work out, typos to correct, etc to put them in a consistent framework. For other models, I read each original publication, and typed into computer raw atomic cylindrical coordinates for each model. Again, quite a few inconsistencies popped up between the different publications with a time span over decades.

Overall, it was a quite tedious undertaking, requiring great attention to details. I am glad that I did that: I learned so much from the process, and more importantly, others can benefit from my effort. As I put in the 3DNA Nature Protocol paper (BOX 6 | FIBER-DIFFRACTION MODELS),

In preparing this set of fiber models, we have taken great care to ensure the accuracy and consistency of the models. For completeness and user verification, 3DNA includes, in addition to 3DNA-processed files, the original coordinates collected from the literature.

For those who want to understand what's going on under the hood, there is no better way than to try to reproduce the process using, e.g., fiber B-DNA as an example.

From the very beginning, I had expected the fiber functionality to be easily reachable and thus beneficial to all those who are interested in nucleic acid structures, especially to build a regular DNA duplex of chosen sequence. In my sense, the fiber program has not been widely used as it should have been, probably due to the fact that many people are not aware of its existence or capability. Hopefully, this blog post would help to get my message across.


PS. Given below is the content of the README file for fiber models in 3DNA
1. The repeating units of each fiber structure are mostly based on the
   work of Chandrasekaran & Arnott (from #1 to #43). More recent fiber
   models are based on Alexeev et al. (#44-#45), van Dam & Levitt (#46
   -#47) and Premilat & Albiser (#48-#55).

2. Clean up of each residue
   a. currently ignore hydrogen atoms [can be easily added]
   b. change ME/C7 group of thymine to C5M
   c. re-assign O3' atom to be attached with C3'
   d. change distance unit from nm to A [most of the entries]
   e. re-ordering atoms according to the NDB convention

3. Fix up of problem structures.
   a. str#8 has no N9 atom for guanine
   b. str#10 is not available from the disk, manually input
   c. str#14 C5M atom was named C5 for Thymine, resulting two C5 atoms
   d. str#17 has wrong assignment of O3' atom on Guanine
   e. str#33 has wrong C6 position in U3
   f. str#37 to #str41 were typed in manually following Arnott's
        new list as given in "Oxford Handbook of Nucleic Acid Structure"
        edited by S. Neidle (Oxford Press, 1999)
   g. str#38 coordinates for N6(A) and N3(T) are WRONG as given in the
        original literature
   h. str#39 and #40 have the same O3' coordinates for the 2nd strand

4. str#44 & 45 have fixed strand II residues (T)

5. str#46 & 47 have +z-axis upwards (based on BI.pdb & BII.pdb)

6. str#48 to 55 have +z-axis upwards

List of 55 fiber structures

id#  Twist   Rise        Structure description
    (dgrees)  (A)
-------------------------------------------------------------------------------
 1   32.7   2.548  A-DNA  (calf thymus; generic sequence: A, C, G and T)
 2   65.5   5.095  A-DNA  poly d(ABr5U) : poly d(ABr5U)
 3    0.0  28.030  A-DNA  (calf thymus) poly d(A1T2C3G4G5A6A7T8G9G10T11) :
                                        poly d(A1C2C3A4T5T6C7C8G9A10T11)
 4   36.0   3.375  B-DNA  (calf thymus; generic sequence: A, C, G and T)
 5   72.0   6.720  B-DNA  poly d(CG) : poly d(CG)
 6  180.0  16.864  B-DNA  (calf thymus) poly d(C1C2C3C4C5) : poly d(G6G7G8G9G10)
 7   38.6   3.310  C-DNA  (calf thymus; generic sequence: A, C, G and T)
 8   40.0   3.312  C-DNA  poly d(GGT) : poly d(ACC)
 9  120.0   9.937  C-DNA  poly d(G1G2T3) : poly d(A4C5C6)
10   80.0   6.467  C-DNA  poly d(AG) : poly d(CT)
11   80.0   6.467  C-DNA  poly d(A1G2) : poly d(C3T4)
12   45.0   3.013  D-DNA  poly d(AAT) : poly d(ATT)
13   90.0   6.125  D-DNA  poly d(CI) : poly d(CI)
14  -90.0  18.500  D-DNA  poly d(A1T2A3T4A5T6) : poly d(A1T2A3T4A5T6)
15  -60.0   7.250  Z-DNA  poly d(GC) : poly d(GC)
16  -51.4   7.571  Z-DNA  poly d(As4T) : poly d(As4T)
17    0.0  10.200  L-DNA  (calf thymus) poly d(GC) : poly d(GC)
18   36.0   3.230  B'-DNA alpha poly d(A) : poly d(T) (H-DNA)
19   36.0   3.233  B'-DNA beta2 poly d(A) : poly d(T) (H-DNA  beta)
20   32.7   2.812  A-RNA  poly (A) : poly (U)
21   30.0   3.000  A'-RNA poly (I) : poly (C)
22   32.7   2.560  Hybrid poly (A) : poly d(T)
23   32.0   2.780  Hybrid poly d(G) : poly (C)
24   36.0   3.130  Hybrid poly d(I) : poly (C)
25   32.7   3.060  Hybrid poly d(A) : poly (U)
26   36.0   3.010  10-fold poly (X) : poly (X)
27   32.7   2.518  11-fold poly (X) : poly (X)
28   32.7   2.596  Poly (s2U) : poly (s2U) (symmetric base-pair)
29   32.7   2.596  Poly (s2U) : poly (s2U) (asymmetric base-pair)
30   32.7   3.160  Poly d(C) : poly d(I) : poly d(C)
31   30.0   3.260  Poly d(T) : poly d(A) : poly d(T)
32   32.7   3.040  Poly (U) : poly (A) : poly(U) (11-fold)
33   30.0   3.040  Poly (U) : poly (A) : poly(U) (12-fold)
34   30.0   3.290  Poly (I) : poly (A) : poly(I)
35   31.3   3.410  Poly (I) : poly (I) : poly(I) : poly(I)
36   60.0   3.155  Poly (C) or poly (mC) or poly (eC)
37   36.0   3.200  B'-DNA beta2  Poly d(A) : poly d(U)
38   36.0   3.240  B'-DNA beta1  Poly d(A) : poly d(T)
39   72.0   6.480  B'-DNA beta2  Poly d(AI) : poly d(CT)
40   72.0   6.460  B'-DNA beta1  Poly d(AI) : poly d(CT)
41  144.0  13.540  B'-DNA  Poly d(AATT) : poly d(AATT)
42   32.7   3.040  Poly(U) : poly d(A) : poly(U) [cf. #32]
43   36.0   3.200  Beta Poly d(A) : Poly d(U) [cf. #37]
44   36.0   3.233  Poly d(A) : poly d(T) (Ca salt)
45   36.0   3.233  Poly d(A) : poly d(T) (Na salt)
46   36.0   3.38   B-DNA (BI-type nucleotides; generic sequence: A, C, G and T)
47   40.0   3.32   C-DNA (BII-type nucleotides; generic sequence: A, C, G and T)
48   87.8   6.02   D(A)-DNA  ploy d(AT) : ploy d(AT) (right-handed)
49   60.0   7.20   S-DNA  ploy d(CG) : poly d(CG) (C_BG_A, right-handed)
50   60.0   7.20   S-DNA  ploy d(GC) : poly d(GC) (C_AG_B, right-handed)
51   31.6   3.22   B*-DNA  poly d(A) : poly d(T)
52   90.0   6.06   D(B)-DNA  poly d(AT) : poly d(AT) [cf. #48]
53  -38.7   3.29   C-DNA (generic sequence: A, C, G and T) (depreciated)
54   32.73  2.56   A-DNA (generic sequence: A, C, G and T) [cf. #1]
55   36.0   3.39   B-DNA (generic sequence: A, C, G and T) [cf. #4]
-------------------------------------------------------------------------------
List 1-41 based on Struther Arnott: ``Polynucleotide secondary structures:
     an historical perspective'', pp. 1-38 in ``Oxford Handbook of Nucleic
     Acid Structure'' edited by Stephen Neidle (Oxford Press, 1999).
     
     #42 and #43 are from Chandrasekaran & Arnott: "The Structures of DNA
     and RNA Helices in Oriented Fibers", pp 31-170 in "Landolt-Bornstein
     Numerical Data and Functional Relationships in Science and Technology"
     edited by W. Saenger (Springer-Verlag, 1990).

#44-#45 based on Alexeev et al., ``The structure of poly(dA) . poly(dT)
     as revealed by an X-ray fiber diffraction''. J. Biomol. Str. Dyn, 4,
     pp. 989-1011, 1987.

#46-#47 based on van Dam & Levitt, ``BII nucleotides in the B and C forms
     of natural-sequence polymeric DNA: a new model for the C form of DNA''.
     J. Mol. Biol., 304, pp. 541-561, 2000.

#48-#55 based on Premilat & Albiser, ``A new D-DNA form of poly(dA-dT) .
     poly(dA-dT): an A-DNA type structure with reversed Hoogsteen Pairing''.
     Eur. Biophys. J., 30, pp. 404-410, 2001 (and several other publications).

Sunday, October 4, 2009

3DNA in molecular dynamics simulations

While updating 3DNA citations last Friday, I came across the paper "Flexibility of Short-Strand RNA in Aqueous Solution as Revealed by Molecular Dynamics Simulation: Are A-RNA and A′-RNA Distinct Conformational Structures?" by Ouyang et al. [Aust. J. Chem. 2009, 62, 1054–1061]. Through molecular dynamics (MD) simulations over a 30 ns period, the authors found that "the identification of distinct A-RNA and A′-RNA structures ... may not be generally relevant in the context of RNA in the aqueous phase." Overall, the paper is nicely written.

I have never performed MD simulations in my research experience, so normally I only read abstracts of such publications just to get a general idea of the main conclusions. What had attracted my attention to this work was its simultaneous citations to three earlier papers:
This is quite unusual, since nowadays it is far more common to only cite 3DNA itself (mostly 2003 NAR and/or 2008 NP). So I decided to have a look of the whole paper. It turned out that the authors used the extra citations to justify their choice of using 3DNA instead of Curves to calculate the helical parameters, including
"the three main structural descriptors commonly used to differentiate between the two forms of RNA – namely major groove width, inclination and the number of base pairs in a helical twist [turn]".

I communicated to the authors about the availability of Curves+. Specifically, using one case (413D, one of the three structures in their Table 1, "Comparison of different results by CURVES and 3DNA programs"), I've tried to illustrate the point that Curves+ and 3DNA now give directly comparable parameters.


Of course, I am glad to see 3DNA being applied to molecular dynamics simulations of nucleic acid structures.
Hopefully, more such applications will show up in the future, and I am willing to offer my help in ways that make sense to me.

Saturday, October 3, 2009

Whenever in doubt, check with the author

Once in a while, I send emails to authors of papers I am interested in, sometimes simply to ask for PDF reprints, mostly to request for clarifications of points I cannot understand fully. Of course, the responses I have received vary significantly: some authors are responsive and are able to answer my questions concretely; while others respond less professionally; in no small percentage, I get no feedback at all. Whatever the case, though, sending querying emails is convenient, and the responses I get (even no response at all) are informative. Naturally, I would take more seriously the papers whose authors are responsive. On the other hand, in my memory, I have never ignored a reader's question of my publications.

Seeking clarification on a scientific software from author(s) or maintainer(s) is even more important due to inherent subtleties of (undocumented) details, as is common in (bio)informatics. In supporting 3DNA over the years, I've experienced quite a few cases where authors of some articles are misinformed in making judgment about 3DNA's functionality. In one case, I read in a paper claiming 3DNA cannot handle Hoogsteen base-pairs while Curves can. A few email exchanges with the corresponding author (who was very responsive and professional) turned out that an internally modified version of Curves was used. More recently, I found a paper claiming that find_pair from 3DNA failed to identify some base pairs in DNA-protein complexes where a new method succeeded. I asked for the missing list, and immediately noticed that simply relaxing some criteria recovered virtually all of the pairs. Thus, to make a convincing comparison of scientific software, it is crucial to check with the original authors to avoid misunderstandings. Serious scientific software developers and maintainers always welcome users' feedback. Why not ask for clarifications if one really wants to make a (strong) point in comparison? Of course, it is another story for unsupported software.

The Internet age has provided unprecedented convenience for scientific communication. It would be a pity not to take full advantage of it. One simple and important thing to do is: whenever in doubt, ask for clarification from corresponding author of a publication or maintainer of a software.

Sunday, September 27, 2009

On reproducibility of scientific publications

In the September 25, 2009 issue of Science (Vol. 325, pp.1622-3), I read with interest the letter from Osterweil et al. "Forecast for Reproducible Data: Partly Cloudy" and the response from Nelson. This exchange of views highlights the difficulty/importance for one research team to precisely reproduce results from another when elaborate computation is involved. As is well known, subtle differences in computer hardware and software, different versions of the same software, or even different options of the same version, could all play a role. Without specifying those details, it is virtually impossible to repeat a publication exactly.

This reminds me of a recent paper "Repeatability of published microarray gene expression analyses" by Ioannidis et al. [Nat Genet. 2009, 41(2):149-55]. In the abstract, the authors summarized their findings:
Here we evaluated the replication of data analyses in 18 articles on microarray-based gene expression profiling published in Nature Genetics in 2005-2006. One table or figure from each article was independently evaluated by two teams of analysts. We reproduced two analyses in principle and six partially or with some discrepancies; ten could not be reproduced. The main reason for failure to reproduce was data unavailability, and discrepancies were mostly due to incomplete data annotation or specification of data processing and analysis.

Specifically, please note that:
  1. The authors are experts on microarray analysis, not occasional application software users.
  2. These 18 articles surveyed were published in Nature Genetics, one of the top journals of its field.
  3. Not a single analysis could be reproduced exactly: two were reproduced in principle, six only partially, and the other ten not at all.
Without being able to reproduce exactly the results from others, it is hard to build upon previous work and move forward. The various reasons for lack of reproducibility listed by Ioannidis et al. are certainly not limited to microarray analysis. As far as I can tell, they also exist in the fields of RNA structure analysis and predictions, energetics of protein-DNA interactions, quantum mechanics calculations, and molecular dynamics simulations etc.

In my experience and understanding, the methods section in journal articles is not, and should not aim to be, detailed enough for exact duplication by a qualified reader. Instead, most such reproducibility issues would be gone if journals require that authors provide raw data, detailed procedures used to process the data, software version and options used to generate the figures and tables reported in the publication. Such information could be made available in (journal or authors) websites. This is an effective way to solve the problem, especially for computational, informatics-related articles. Over the years, for papers I am the first author or I have made major contributions, I've always kept a folder for each article to include every detail (data files, scripts etc) so that the published tables and figures can be repeated precisely. This has turned out to be extremely helpful when I want to refer back to early publications, or when I was asked by readers for further details.

As noted by Osterweil et al., "repeatability, reproducibility, and transparency are the hallmarks of the scientific enterprise." To really achieve the goal, every scientist needs to pay more attention to details and be responsive. Do not be fool around by the impressive introduction or extensive discussions (which are important, of course) in a paper: to get the bottom of something, it is usually the details that count.

Saturday, September 26, 2009

RiboClub 10th Annual Meeting -- it was a good one!

I attended the RiboClub 10th Annual Meeting held during September 21 to 23 at Hotel Cheribourg, Orford, Quebec. Everyday, the schedule was fulfilled from early morning to late night. Indeed, it was so tight and intensive that I could not even find a time to walk around in the beautiful season. Overall, though, the meeting was well-organized and had a fantastic scientific program.

The RiboClub was founded ten years ago by researchers at the University of Sherbrooke, Quebec. Initially, it was local, and then the club was extended to nearby areas, and the whole Canada. As time goes, its influence has also passed the border so at its 10th anniversary, several leading RNA experts (including Phillip Sharp, Jack Szostak, Tim Nilsen, Tom Steitz et al.) from the USA also participated.

I noticed the RiboClub meeting early this year when I was writing up a manuscript on an RNA structural motif. As hinted in another post, "Does 3DNA work for RNA?", I've recently been attracted to the field of RNA structures. Using 3DNA, we have uncovered a simple RNA-specific interaction that is biologically relevant, yet virtually ignored by the community. So attending a RNA meeting would allow me to learn more about RNA and to pass our message across to a wider audience.

The meeting was mostly on various aspects of RNA biology. Of which, 3-dimensional structure is an integral part, yet no specific session was devoted to it. The same applied to bioinformatics tools and applications. Thus, for example, Tom Steitz's talk on the ribosome structures was under the session titled "Translation: targets and impact".

The organizers obviously paid attention to arrange meeting participants at the dinning table. So for Tuesday (Sept. 22) night, I sit next to Dr. Andrew MacMillan from University of Alberta. It was a nice surprise to know that Dr. MacMillan works on "structural and functional characterization of splicesome assembly and activation." I took this opportunity to read the abstracts from his lab and to talk to him about my findings that are related to pre-mRNA splicing. He visited my post the next day (Wednesday, Sept 23) and we discussed it in more details. On the Gala dinner on Wednesday, I was arranged to sit between Dr. Paul Griffiths, "a philosopher of science with a focus on biology and psychology", and Dr. W. Ford Doolittle, a leading scientist in Comparative Genomics. It was a valuable experience to hear them and others around the table talking on politics and science-related issues.

It is worth noting that Wednesday's dinner speaker was Alexander Rich. Wearing his tie from the famous RNA Tie Club, Dr. Rich talked about "The era of RNA awakening: structural biology of RNA in the early years." At the age of 85, he still spoke clearly and logically. His story telling style was very effective and his talk was well-received by the audience. For those who are interested in knowing more about Dr. Rich's work, I would strongly recommend his article "The excitement of discovery."

The "Neo-Traditional Quebec Music" show on Wednesday night was exciting and relaxing, following and in contrast to the three-day long intensive scientific program. Although I did not understand the music that well, I stayed until the very end.

Saturday, September 5, 2009

Double helix groove width parameters from 3DNA

In the 3DNA output (from the analyze program) for a DNA/RNA duplex structure, there is a section on "Minor and major groove widths: direct P-P distances and refined P-P distances which take into account the directions of the sugar-phosphate backbones". The underlying algorithm is that of El Hassan and Calladine (1998). ``Two Distinct Modes of Protein-induced Bending in DNA.'' J. Mol. Biol., v282, pp331-343. Note that the P-P distances need to be subtracted by 5.8 Å to take account of the vdw radii of the phosphate groups (2.9 Å), and for comparisons with NewHelix/FreeHelix and Curves.

Using 3DNA fiber models #1 for A-DNA (calf thymus) and #4 for B-DNA (calf thymus), the groove widths are as follows:
                 Minor Groove        Major Groove
P-P Refined P-P Refined
-----------------------------------------------------
A-DNA (#1) 18.5 16.7 15.2 11.1
B-DNA (#4) 11.7 11.7 17.2 17.2
-----------------------------------------------------
From the above table, it is clearly that for A-DNA, the minor and major groove widths for the refined set are smaller than their corresponding non-refined counterparts (i.e., those based on direct P-P distances). For B-DNA, there are no changes between the two sets. It should be noted that in real structures (i.e., non-perfectly regular, as in X-ray crystal structures in the NDB/PDB), there are nearly always some differences between the refined vs. direct P-P distances. As a general rule, the refined set should be used.

One of the key structural differences between A- and B-DNA is their opposite groove dimensions: for B-DNA, the major groove width (~17 Å) is about 5 Å wider than the minor groove width (~12 Å); whereas for A-DNA, the major groove width (~11 Å) is narrower than the minor groove width (~17 Å) by a similar amount. Since the grooves provide binding sites, the difference between A- and B-DNA grooves has important implications in DNA (groove) recognitions by ligands or proteins.

In retrospect, I implemented the El Hassan and Calladine algorithm for calculating the groove widths mainly because of its simplicity: I can understand clearly how it works visually. The algorithm is described in a two-page appendix of the above cited paper. For those who are interest in DNA structures in general and how groove widths are defined in particular, I would strongly recommend them to read the appendix carefully and try to implement it: there is no substitute for first hand experience. For an idealized cases, as the above for fiber A- and B-DNA, the implementation should be straightforward. To be more realistic, an implementation should account for missing phosphate groups in some structures (for testing purpose, simply delete one P atom from a structure), for example.

As is obvious, 3DNA does not calculate groove depths. Over the years, I have actually been approached with requests/suggestions to provide such parameters to complement groove widths. However, for various reasons, none of the algorithms fits with 3DNA. As a general principle, I do not add new functionality to 3DNA simply for the seek of it. I must understand a new piece clearly in order to integrate it with the rest and to be able to respond concretely to possible questions from users.

Some emacs tricks

As a Linux/Unix fan, I am very familiar with vi and use it for quick and simple text editing purposes. Over the years, however, I have been using emacs the most: I like its color coding and programming language-specific editing mode.

Emacs is well-known for its extensibility, so there are many ways to customize it to suit one's taste. In my experience, I have found the following settings convenient:
  • Highlight the current line with a background color (here "greenyellow")
    (require 'highlight-current-line)
    (highlight-current-line-on t)
    (highlight-current-line-set-bg-color "greenyellow")
  • Set transient mark mode on, so that selected text become more obvious
    (setq transient-mark-mode t)
  • Show line number and column number
    (setq line-number-mode t)
    (setq column-number-mode t)
There are many features in emacs that could be handy and I am always trying to learn more of it.

Sunday, August 30, 2009

JMB celebrates 50 years of protein structure determination

In the September 11, 2009 issue (vol.392, issue 1) of JMB, there are a series of three articles on the determination of the first two protein structures (myoglobin and haemoglobin), an achievement accomplished by Perutz and Kendrew and their colleagues at the Cambridge MRC laboratory in 1950s. Of special interest of this series is the three authors — Bror Strandberg, Richard Dickerson and Michael Rossmann — leading scientists in structure biology, then postdocs actively involved in the late stage of the structure determination.

These reviews are vividly written and provide interesting background information and some technical details on X-ray crystal structure determination (especially on phase angle), and have the following titles:
  1. "Building the Ground for the First Two Protein Structures: Myoglobin and Haemoglobin" by Bror Strandberg
  2. "Myoglobin: A Whale of a Structure!" by Richard Dickerson
  3. "Recollection of the Events Leading to the Discovery of the Structure of haemoglobin" by Michael Rossmann
With limited computing power and software support, protein structure determination was a difficult task at that time. Dickerson and Rossmann had to write their own programs to perform some calculations. However, the firsthand experience in both experiment and code development, on a significant project in a famous lab, may in part accounts for their success in structure biology.

I have known Dickerson's work on nucleic acid structures for a while, firstly through the famous Drew-Dickerson dodecamer (CGCGAATTCGCG), and I am intimately familiar with his NewHelix/FreeHelix programs. Nevertheless, it is only after reading his above article do I become aware of his initial protein experience. I like Dickerson's writing a lot. For example, on commenting the different styles of Kendrew and Perutz, he wrote: "John was the mentor, guide, and organizer. …… In contrast, Max was a hands-on bench biochemist whose center of gravity was always the laboratory itself. …… Both styles had their merits: one learned from John, but one learned with Max."

An interesting point from Rossmann's article is his description of a secret he kept to himself for many decades: "I [Rossmann] had been privileged to work on the haemoglobin project with Max, but it was also a project that Max had given his whole life to develop. In my enthusiasm to look at the results, I stole the final discovery from Max. …… With the realization of what I had done, all desire to explore further was completely gone." While I vaguely remembered this story from reading the book "Max Perutz and the Secret of Life" by Georgina Ferry several months ago, Rossmann's personal account would make it unforgettable.

It is worth noting that Perutz and Rossmann were among those few who initiated the PDB in 1971 at a Cold Spring Harbor meeting. Finally, given the expertise of the three authors, it is not surprising to read in the Epilogue that "Indeed, structural biology has become the unifying factor of just about every aspect of biology."

Saturday, August 22, 2009

Fit a least squares plane to a set of points

In the early days when I first got into the field of DNA structures, I was interested in knowing details on how the "best mean (base-pair) plane" is actually defined. I could not find a solution from the articles I read at that time. For clarification, I sent emails to some authors whose papers mentioned the concept, still the results were less than satisfactory.

The mystery was finally solved after I read the following two articles from early days (traced initially from a comment in the source code of Dickerson's NewHelix, if I recall it correctly):
  • D. M. Blow (1960). ``To Fit a Plane to a Set of Points by Least Squares.'' Acta Cryst., 13, 168.
  • V. Schomaker et al. (1959). ``To Fit a Plane or a Line to a Set of Points by Least Squares.'' Acta Cryst., 12, 600-604.
As is clear from the titles, they focused just on least-squares plane/line fitting algorithms. More specifically, the worked example(s) help make the underlying concept clear.

As a concrete example, let's work out the least-square plane of an adenine using Octave/Matlab. The implementation here is much simpler than those in the literature cited above. Note: this example is mentioned in the 3DNA home page in the Technical Details section, so the order of atoms is kept the same as there.

The x-, y-, and z-coordinates are as follows:
xyz = [
16.461 17.015 14.676 % 91 N9
15.775 18.188 14.459 % 100 C4
14.489 18.449 14.756 % 99 N3
14.171 19.699 14.406 % 98 C2
14.933 20.644 13.839 % 97 N1
16.223 20.352 13.555 % 95 C6
16.984 21.297 12.994 % 96 N6
16.683 19.056 13.875 % 94 C5
17.918 18.439 13.718 % 93 N7
17.734 17.239 14.207 % 92 C8
];

Cmtx = cov(xyz)
[V, LAMBDA] = eig(Cmtx)

The covariance matrix (Cmtx), its corresponding eigenvectors (V) and eigenvalues (LAMBDA) are as follows:
Cmtx =
1.66800 -0.50154 -0.32530
-0.50154 2.06703 -0.58401
-0.32530 -0.58401 0.30605

V =
0.27367 0.83264 -0.48147
0.32243 0.39220 0.86152
0.90617 -0.39102 -0.16113

LAMBDA =
0.00001 0.00000 0.00000
0.00000 1.58452 0.00000
0.00000 0.00000 2.45656
The smallest eigenvalue of Cmtx, 0.00001 is close to zeros, indicating that the base is almost perfectly planar. The corresponding unit eigenvector, i.e., the least squares base normal is: [0.27367 0.32243 0.90617]. Of course, the geometric average of the atoms defines a point the ls-plane pass through.

Referring back to the concept of "best mean (base-pair) plane", it could simply be defined using all base atoms of the pair, or alternatively, as the mean of the two base normals. Obviously, they would lead to (slight) numerical differences. Astute viewers would also notice some other subtleties.

Sunday, August 16, 2009

Curves+ vs 3DNA

While browsing Nucleic Acids Research recently, I noticed the paper titled "Conformational analysis of nucleic acids revisited: Curves+" by Dr. Lavery et al. [advance access published online on July 22, 2009]. I read it through carefully during the weekend and played around with the software. Overall, I was fairly impressed, and also happy to see that "It [Curves+] adopts the generally accepted reference frame for nucleic acid bases and no longer shows any significant difference with analysis programs such as 3DNA for intra- or inter-base pair parameters."

Anyone who has ever worked on nucleic acid structures (especially DNA) should be familiar with Curves, an analysis program that has been widely used over the past twenty years. Only in recent years has 3DNA become popular. By and large, though, it is my opinion that 3DNA and Curves are constructive competitors in nucleic acid structure analysis with complementary functionality. As I put it six years ago, before the 13th Conversation at Albany: "Curves has special features that 3DNA does not want to repeat/compete (e.g. global parameters, groove dimension parameters). Nevertheless, we provide an option in a 3DNA utility program (find_pair) to generate input to Curves directly from a PDB data file" on June 6, 2003, and emphasized again on June 09, 2003: "We also see Curves unique in defining global parameters, bending analysis and groove dimensions." 3DNA's real strength, as demonstrated in our 2008 Nature Protocols paper, lies in its integrated approach that combines nucleic acid structure analysis, rebuilding, and visualization into a single software package (see image to the right and above).

Now the nucleic acid structure community is blessed with the new Curves+, which "is algorithmically simpler and computationally much faster than the earlier Curves approach", yet still provides its 'hallmark' curvilinear axis and "a full analysis of groove widths and depth". When I read the text, I especially liked the INTRODUCTION section, which provides a nice summary of relevant background information on nucleic acid conformational analysis. An important feature of Curves+ is its integration of the analysis of molecular dynamics trajectories. In contrast, 3DNA lacks direct support in this area (even though I know of such applications from questions posted on the 3DNA forum), mostly due to the fact that I am not an 'energetic' person. Of special note is a policy-related advantage Curves+ has over 3DNA: Curves+ is distributed freely, and with source code available. On the other hand, due to Rutgers' license constraints and various other (undocumented) reasons, 3DNA users are still having difficulty in accessing 3DNA v2.0 I compiled several months ago!

It is worth noting that the major differences in slide (+0.47 Å) and X-displacement (+0.77 Å) in Curves+ vs the old Curves (~0.5 Å and ~0.8 Å, respectively) are nearly exactly those uncovered a decade ago in "Resolving the discrepancies among nucleic acid conformational analyses" [Lu and Olson, J Mol Biol. 1999 Jan 29; 285(4):1563-75]:
Except for Curves, which defines the local frame in terms of the canonical B-DNA fiber structure (Leslie et al., 1980), the base origins are roughly coincident in the different schemes, but are significantly displaced (~0.8 Å along the positive x-axis) from the Curves reference. As illustrated below, this offset gives rise to systematic discrepancies of ~0.5 Å in slide and ~0.8 Å in global x-displacement in Curves compared with other programs, and also contributes to differences in rise at kinked steps. (p. 1566)

Please note that Curves+ has introduced new name list variables — most notably, lib= — and other subtle format changes, thus rendering the find_pair generated input files (with option '-c') no longer valid. However, it would be easy to manually edit the input file to make it work for Curves+, since the most significant part — i.e., specifying paired nucleotides — does not change. Given time and upon user request, however, I would consider to write a new script to automate the process.

Overall, it is to the user community's advantage to have both 3DNA and Curves+ or a choice between the two programs, and I am more than willing to build a bridge between them to make users' lives easier.

Saturday, August 15, 2009

Eddy's primer on the BLOSUM62 alignment score matrix

As a Linux/Unix fun, I like its philosophy, as summarized by Doug McIlroy, very much: "Write programs that do one thing and do it well. Write programs to work together." In science, I enjoy more reading an article that focuses on one point and address it clearly and thoroughly. It is only after a complete understanding of the components can it be possible to combine them in unique, purpose-specific ways. Those days, though, such type of simple-and-clean articles is no longer that common as it was in the early days, say 1960s or 70s. This post is the first of a series on such articles I've found useful, or on computational tricks that I have learned over the years.


I came across the primer titled "Where did the BLOSUM62 alignment score matrix come from?" by Sean Eddy [Nat Biotechnol. 2004 Aug;22(8):1035-6] early this month through the BioConductor mailing list where it was recommended by Dr. Philipp Pagel as a "well written article". I read the title and the short abstract from PubMed, and then download the PDF version of the whole article.

As a primer, it is only two pages long (short), and reading twice won't take that much time. It explains the meaning of the BLOSUM62 amino acid score matrix and where is comes from clearly. The number 62, for example, stands for a threshold of 62% identity. Other percentages, such as 80% (more Conservative), 45% (more divergent) are also possible and may be more suitable for specific applications. As noted by Eddy, "Empirically, the BLOSUM matrices have performed very well. BLOSUM62 has become a de facto standard for many protein alignment programs."

With the above background, it is much easier to understand the fundamental difference of the default scoring systems between FASTA/WU-BLASTN vs NCBI BLASTN for alignment of DNA sequences: the former is optimal for alignments at the ‘twilight zone’ (65% identity), while the later (NCBI BLASTN) is optimal for homologous DNA at a much higher 95% identity level. Knowing of such subtlety is very important in avoid making false conclusions.

More significantly (to me), there is a supplemental material -- a well-documented, self-contained "ANSI C program for calculating the implicit target frequencies pab of a score matrix": it clarifies every details for those who want to get to the bottom of the topic.

Sunday, August 9, 2009

On third-party interfaces to 3DNA

One clear sign of 3DNA's acceptance by the nucleic-acid structure-related scientific community is its integration into a wide range of GUI/web-interfaces (e.g., NDB, w3DNA, 3D-DART etc) and many other packages. As the author and maintainer of the 3DNA software package, I am (of course) very happy to see this happen: I can easily imagine that better accessibility would make 3DNA open to an even larger community, including non-experts in nucleic acid structures (e.g., for educational purpose). It should be emphasized, however, that point-and-click interfaces to 3DNA (and any other serious scientific software, for that matter), while convenient to occasional users and for routine jobs, have their limitations.

Overall, a point-and-click interface is only sensible to well-defined, routine tasks. In the context of 3DNA, the following list of tasks (not necessarily complete) is perfectly suitable:
  • Generate one of the 55 fiber models, where everything (model number, sequence or number of repeats) can be unambiguously defined through a web-form
  • Build an arbitrary DNA model with a user-specific parameter set
  • Create a blocview image for a nucleic-acid containing structure specified in PDB format
  • Analyze a regular (i.e., not deviating too much) double helix structure, be it in A-, B-, or Z-form
  • Calculate a list of backbone torsion angles and other parameters (with the "-s" option of find_pair and then analyze)
  • Find a list of all possible (RNA) base-pairs fulfilling a specific set of geometric criteria (with the "-p" option of find_pair).
Other than those (and carefully selected/qualified other functionality), users should be very careful in what they get from running a third-party tool outputting 3DNA-related parameters. I know of the so many subtleties in using 3DNA to solve a specific problem, and most of the time the command-line driven style is the most effective, as it allows for easy automation of try-and-error iterations. Software developers who integrate 3DNA into their packages should make the limitations clear to their users, and direct any 3DNA-specific problems to the 3DNA forum.

In today's informatics world, there are so many "easy-to-use" tools available, claiming to be able to solve all sorts of problems (well, that's understandable -- otherwise, how could one get published, especially in the big journals/magazines?). Any serious scientist, however, should know what he/she is doing: it is easy to get some (magic) numbers by clicking a button, but understand clearly what the numbers mean is yet another story. I cannot emphasize more the importance of knowing one's tools, including their limitations.

Saturday, August 8, 2009

Prefatory articles in Annual Review of Biochemistry make good reading

Over the time, I have read several (mostly structural biology related) prefatory articles published in the Annual Review of Biochemistry and found them quite interesting. Some examples:
  • In Vol. 78 (July 2009), James Wang's article, titled "A Journey in the World of DNA Rings and Beyond", told a story behind the discovery of DNA topoisomerase, an enzyme that converts one form of DNA ring to another.
  • In Vol. 73 (July 2004), Alexander Rich wrote about "The Excitement of Discovery" as a scientist. Rich's research has often been "on the question how molecular structure leads to biological function". I am especially impressed by his vivid description of the work with Crick on the structure of collagen (p.10-12) and its "strong positive effect" on his psyche:
    For one thing, I began to develop some self-assurance in my ability to carry out research and make discoveries. I believe that a form of “scientific maturation” is an important component in developing a confident thrust into research work.
  • In Vol. 40 (July 1971), John Edsall provided interesting insights about his editorial work in his article titled "Some Personal History and Reflections from the Life of a Biochemist".
Overall, such articles are well worth reading -- they provide background information and the context on the discoveries made by those leading scientists.