Friday, October 23, 2009

Sharing published data?

A letter in the recent issue of Science [VOL 326 23 OCTOBER 2009] titled "The Antidote to Bias in Research" by Allison mentioned the Nature Genetics article on the issue of wide-spread non-repeatability of published microarray gene expression studies. It also touched on another recent study titled "Empirical Study of Data Sharing by Authors Publishing in PLoS Journals" by Savage and Vickers [PLoS ONE 4(9): e7078]:
In another study, refusal to share data despite policies requiring sharing was nearly ubiquitous among authors publishing in Public Library of Science journals.
I then read the Savage and Vickers article, and was surprised to find that only one author out of 10 sent them an original data set. Therefore, the authors wrote:
In conclusion, our findings suggest that explicit journal policies requiring data sharing do not lead to authors making their data sets available to independent investigators.
I can understand why the repeatability rate is so low from the Nature Genetics paper given the so many details associated with a published table or figure. However, it is a bit hard to explain why it is so difficult to share just the original dataset associated with a published work. I would imagine that the authors would be honored that others care about their publications, and contact them directly.

In my experience, I have also been simply ignored frequently for clarifications of some details in published work, or requests for some datasets. The highest successful rate is asking for PDF represents from corresponding authors.

If the basic principles of scientific publications are strictly enforced, even just by the big journals (so many, nowadays), a lot of unfounded big claims would be gone or can be easily seen through. Well, I know that's only possible in an ideal world.

How to quantify the relative geometry between two base-pairs? — Part 2

In part I of this series, and indeed other occasions in my blog, 3DNA home page and forum, I have attributed to several related programs where 3DNA has benefited from. Before going into details on how the various parameters are calculated in 3DNA, however, I feel it is in order to make it clear my philosophy on creating scientific software in general, and 3DNA in particular.

Some underlying considerations

  • Whenever possible/applicable, I always try to get a thorough understanding of related software available. By thorough I mean at the source code level to see how an algorithm is implemented. The benefits of doing this are several folds:
    1. Not to recreate the wheel, but to build upon previous work.
    2. It is the most effective way to learn — many math formulae or text descriptions of algorithms, while useful for getting general ideas, are lack in details or vague in nature. In bioinformatics, most of the fundamental mathematics are not new. It is more about a novel combination of various known-parts, applying to a specific problem. Reading the source code is the only way to see unambiguously the implementation details.
    3. With a clear understanding of how the parameters are actually calculated, one can make better use of a software tool, know its limitations and thus avoid misinterpretations.
    4. To make objective, convincing (even to the original authors) comparisons/comments on existing tools, and importantly, to create something better if needed.
  • Specially, 3DNA has benefited the most from my SCHNAaP/SCHNArP programs: the underlying algorithms in 3DNA for calculating the various structural parameters (propeller, buckle, shear, roll, slide, x-displacement, inclination etc) are exactly the same as there — analyze and rebuild are direct derivatives from SCHNAaP (a for analysis) and SCHNArP (r for rebuilding/reconstruction), respectively; the ideas of standard stacking diagrams, the Zp parameter, building atomic models with sugar-phosphate backbone, and the base/bp rectangular block representations are also from SCHNAaP/SCHNArP. I also borrowed ideas from Babcock's RNA, Bansal's NUPARM, Dickerson's NewHelix/FreeHelix etc., which will be mentioned explicitly in following sessions.
  • Code wise, 3DNA was created from scratch in strict ANSI C (over 25K lines). In addition to implementing a unique combination of existing algorithms, I have added many significant novel methods including the find_pair program. However, feature-rich has never been my goal. Instead, I only consider to add new functions that I understand clearly and find useful. On the other hand, I am always quick to fix bugs.
Overall, it seems that I am an outlier rather than norm in scientific programming. The reasons could be as follows:
  1. It is hard to understand other people's implementations of algorithms since it needs to read between the lines. This is especially the case with a unfamiliar computer language; undocumented, or bad code. It is easier to create one's own program than to understand and modify third-party software.
  2. For publication purpose, it is more attractive to work on something "new" instead of building on others.
Thus, in bioinformatics field, there are far more "new" methods papers published than refinements of previous ones. However, what claimed/appeared to be novel may not be the case under the hook, due to different terminologies used in different fields. Moreover, while it is easy to create a self-claimed "new" program, others could simply do the same. In my experience, it is well worth the effort to really understand leading programs if one is serious about getting into an informatics field: it was really revealing and exciting when I went thorough CEHS, NewHelix/FreeHelix, RNA, Curves etc. Then I know clearly what was "the start-of-the-art" of the field and how I can do better: that's exactly how 3DNA came about!

In the following sessions, I will provide details on how 3DNA calculates the two sets of parameters to quantify the relative geometry of a dinucleotide step.

Sunday, October 18, 2009

Duplicate tab function in Adobe READER 9.1 is very handy

In reading a scientific paper, one often needs to jump back and forth for referred to tables, figures and references, etc. Those days, PDF is the standard way to share an e-document, and Acroread is the norm to view its content. I used to open two copies of the same document in two instances of Acroread (or one Acroread, one xpdf) so I can read continuously in one while moving around in the other, mostly at the end for references. This works, but obviously less than ideal.

One day, purely by chance, I (mouse) right-clicked the tab for the document I was viewing, and noticed that it popped up with two options: Detach Tab and Duplicate Tab. Clicking Duplicate Tab led to duplication of the same document in another tab. Actually, this process can be repeated more than once, thus allowing for multiple views of the same document simultaneously. Very neat!

Nowadays, whenever I read a scientific publication in Acroread, I often duplicate tab to have two views. It is only a click away to switch between the two. Thus when a citation is referred to in the main text, I can immediately see at the reference section what it is about.

A word of caution: I am using Ubuntu 9.04, with Acroread v9.1.2 05/25/2009. I have no idea from which version and on what platform such function was added.

How to quantify the relative geometry between two base-pairs? — Part 1

In the field of double helical DNA (and RNA) structures, the following two sets of parameters to normally used to quantify the relative geometry of the two base pairs in a dinucleotide step:
  • shift, slide, rise, tilt, roll, and twist — which I call them stacking or simple step parameters
  • x-displacement, y-displacement, helical rise, inclination, tip, and helical twist — which I call them helical parameters
3DNA calculates all of these parameters. Over the years, I have been approached quite a few times on the local helical parameters interpretation question. In literature, I've noticed numerous times of confusions the community still has over the two set of parameters, especially with regard to the issue of "twist vs helical twist" and "rise vs. helical rise". This series of blog posts is aimed to clarify the problem by providing some background information and step-by-step worked examples illustrating how the parameters are actually calculated in 3DNA (which follows SCHNAaP/SCHNArP). In my experience, knowing how the parameters are derived is the key to understand what they mean, and thus to avoid misinterpretations.

Part 1 — background information

  1. The Calladine and El-Hassan Scheme (CEHS) calculates only the six step parameters, i.e., shift, slide, rise, tilt, roll, and twist. The original CEHS implementation was a few FORTRAN subroutines taking advantage the code base of Dickerson's well-known NewHelix program. In developing SCHNAaP/SCHNArP, a collaborative project between the then Sheffield and Cambridge groups, I also went through the source code of NewHelix to get a thorough knowledge of its internals in order to integrate its nice features into SCHNAaP.

    One thing I noticed was x-displacement, a parameter characterizing the "hole" of A-DNA in top view. The global helical axis from NewHelix was (still is) appealing to me, especially for a short, relatively straight fragment. Even for a clearly non-straight duplex, e.g., in super helical nucleosome core particle DNA, or a severely kinked DNA, the deviation from regular linear helix serves as a good parameter to quantify its overall non-linearity. As a side note, a single so-called bending angle is often misleading — for example, the bending angle as commonly cited in literature (mostly calculated from Curves), is strongly influenced by the two terminal base-pairs.

    So I devised a parallel set of global helical parameters following the CEHS scheme of angle combination: here instead of roll-tilt, I used tip-inclination. The definition of global helical axis and the point the helix passes through are based on my simplified implementation in ANSI C of the algorithms used by NewHelix.

    Thus SCHNAaP calculates both a set of local CEHS step parameters, and a set of global helical parameters — a unique combination of CEHS and NewHelix, made possible only after a thorough understanding of both methods. SCHNArP was developed to complete the circle, i.e., to rebuild a structure given a set of parameters, either the local step parameters: shift, slide, rise, tilt, roll, and twist, or the global helical parameters: x-displacement, y-displacement, helical rise, inclination, tip, and helical twist.

  2. The RNA (Running Nucleic Acids) program by Babcock et al. calculates a set of local helical parameters. While working at Rutgers in Dr. Olson's group, I was interested in knowing how the local helical axis and the point its passes through were defined in the RNA program, since it would naturally substitute for the global one in SCHNAaP/SCHNArP to make two sets of purely local parameters. As it turned out, however, the algorithm was finally implemented in the 3DNA software package, as summarized in the email I sent out before the 13th Conversation at Albany:
    The way to calculate the local helical axis and helical parameters in 3DNA is *essentially the same* as in RNA, but expressed in a much simpler way. First, the local helical axis is calculated as dx-times-dy, following Bansal, which gives the *same* result as the "single rotation axis" detailed in the RNA paper. I still could not figure out WHY, but verified this numerically. The location where the helix passes through is based directly on the RNA paper (i.e., following Chasles's theorem). The procedure to calculate helical parameters, i.e., tip/inclination/x-disp/y-disp/etc, following the SCHNAaP paper (i.e, with tip-inclination combination, which is consistent with the roll-tilt, propeller-buckle combinations in 3DNA). What amazed us is that this much simpler tip-inclination implementation in 3DNA gives *exactly* the same numerical values as the original RNA algorithm.
  3. In summary, the two sets of local parameters as implemented in 3DNA are based on a unique combination of nice features from several programs well-known in the community, including SCHNAaP/SCHNArP (CEHS), NewHelix, RNA, and NUPARM. However, the 3DNA calculated parameters are numerically different from any of them, partially because 3DNA adopts the standard base reference frame. To the best of my knowledge, 3DNA is the only software that allows for rigorous conversion between the two sets of local parameters — a utility program in 3DNA called step_hel illustrates this simple fact. Again, as I wrote before the 13th Conversation at Albany:
    Take each base-pair as a rigid block, only 6 parameters are required to relate one bp to the other. Two sets are commonly used: one set is "shift, slide, rise, tilt, roll and twist", and the other set is "x-displacement, y-displacement, helical rise, inclination, tip, and helical twist". Obviously, these two sets of parameters should be directly convertible, as demonstrated by Calladine and Drew in their 1984 JMB B-to-A transition paper, and illustrated in 3DNA manuscript.
  4. Links to other programs referred to in this section: