Xiang-Jun's Corner

Friday, October 23, 2009

Sharing published data?

A letter in the recent issue of Science [VOL 326 23 OCTOBER 2009] titled "The Antidote to Bias in Research" by Allison mentioned the Nature Genetics article on the issue of wide-spread non-repeatability of published microarray gene expression studies. It also touched on another recent study titled "Empirical Study of Data Sharing by Authors Publishing in PLoS Journals" by Savage and Vickers [PLoS ONE 4(9): e7078]:

In another study, refusal to share data despite policies requiring sharing was nearly ubiquitous among authors publishing in Public Library of Science journals.

I then read the Savage and Vickers article, and was surprised to find that only one author out of 10 sent them an original data set. Therefore, the authors wrote:

In conclusion, our findings suggest that explicit journal policies requiring data sharing do not lead to authors making their data sets available to independent investigators.

I can understand why the repeatability rate is so low from the Nature Genetics paper given the so many details associated with a published table or figure. However, it is a bit hard to explain why it is so difficult to share just the original dataset associated with a published work. I would imagine that the authors would be honored that others care about their publications, and contact them directly.

In my experience, I have also been simply ignored frequently for clarifications of some details in published work, or requests for some datasets. The highest successful rate is asking for PDF represents from corresponding authors.

If the basic principles of scientific publications are strictly enforced, even just by the big journals (so many, nowadays), a lot of unfounded big claims would be gone or can be easily seen through. Well, I know that's only possible in an ideal world.

How to quantify the relative geometry between two base-pairs? — Part 2

In part I of this series, and indeed other occasions in my blog, 3DNA home page and forum, I have attributed to several related programs where 3DNA has benefited from. Before going into details on how the various parameters are calculated in 3DNA, however, I feel it is in order to make it clear my philosophy on creating scientific software in general, and 3DNA in particular.

Some underlying considerations

Whenever possible/applicable, I always try to get a thorough understanding of related software available. By thorough I mean at the source code level to see how an algorithm is implemented. The benefits of doing this are several folds:
1. Not to recreate the wheel, but to build upon previous work.
2. It is the most effective way to learn — many math formulae or text descriptions of algorithms, while useful for getting general ideas, are lack in details or vague in nature. In bioinformatics, most of the fundamental mathematics are not new. It is more about a novel combination of various known-parts, applying to a specific problem. Reading the source code is the only way to see unambiguously the implementation details.
3. With a clear understanding of how the parameters are actually calculated, one can make better use of a software tool, know its limitations and thus avoid misinterpretations.
4. To make objective, convincing (even to the original authors) comparisons/comments on existing tools, and importantly, to create something better if needed.

Specially, 3DNA has benefited the most from my SCHNAaP/SCHNArP programs: the underlying algorithms in 3DNA for calculating the various structural parameters (propeller, buckle, shear, roll, slide, x-displacement, inclination etc) are exactly the same as there — analyze and rebuild are direct derivatives from SCHNAaP (a for analysis) and SCHNArP (r for rebuilding/reconstruction), respectively; the ideas of standard stacking diagrams, the Zp parameter, building atomic models with sugar-phosphate backbone, and the base/bp rectangular block representations are also from SCHNAaP/SCHNArP. I also borrowed ideas from Babcock's RNA, Bansal's NUPARM, Dickerson's NewHelix/FreeHelix etc., which will be mentioned explicitly in following sessions.
Code wise, 3DNA was created from scratch in strict ANSI C (over 25K lines). In addition to implementing a unique combination of existing algorithms, I have added many significant novel methods including the find_pair program. However, feature-rich has never been my goal. Instead, I only consider to add new functions that I understand clearly and find useful. On the other hand, I am always quick to fix bugs.

Overall, it seems that I am an outlier rather than norm in scientific programming. The reasons could be as follows:

It is hard to understand other people's implementations of algorithms since it needs to read between the lines. This is especially the case with a unfamiliar computer language; undocumented, or bad code. It is easier to create one's own program than to understand and modify third-party software.
For publication purpose, it is more attractive to work on something "new" instead of building on others.

Thus, in bioinformatics field, there are far more "new" methods papers published than refinements of previous ones. However, what claimed/appeared to be novel may not be the case under the hook, due to different terminologies used in different fields. Moreover, while it is easy to create a self-claimed "new" program, others could simply do the same. In my experience, it is well worth the effort to really understand leading programs if one is serious about getting into an informatics field: it was really revealing and exciting when I went thorough CEHS, NewHelix/FreeHelix, RNA, Curves etc. Then I know clearly what was "the start-of-the-art" of the field and how I can do better: that's exactly how 3DNA came about!

In the following sessions, I will provide details on how 3DNA calculates the two sets of parameters to quantify the relative geometry of a dinucleotide step.

Sunday, October 18, 2009

Duplicate tab function in Adobe READER 9.1 is very handy

In reading a scientific paper, one often needs to jump back and forth for referred to tables, figures and references, etc. Those days, PDF is the standard way to share an e-document, and Acroread is the norm to view its content. I used to open two copies of the same document in two instances of Acroread (or one Acroread, one xpdf) so I can read continuously in one while moving around in the other, mostly at the end for references. This works, but obviously less than ideal.

One day, purely by chance, I (mouse) right-clicked the tab for the document I was viewing, and noticed that it popped up with two options: Detach Tab and Duplicate Tab. Clicking Duplicate Tab led to duplication of the same document in another tab. Actually, this process can be repeated more than once, thus allowing for multiple views of the same document simultaneously. Very neat!

Nowadays, whenever I read a scientific publication in Acroread, I often duplicate tab to have two views. It is only a click away to switch between the two. Thus when a citation is referred to in the main text, I can immediately see at the reference section what it is about.

A word of caution: I am using Ubuntu 9.04, with Acroread v9.1.2 05/25/2009. I have no idea from which version and on what platform such function was added.

How to quantify the relative geometry between two base-pairs? — Part 1

In the field of double helical DNA (and RNA) structures, the following two sets of parameters to normally used to quantify the relative geometry of the two base pairs in a dinucleotide step:

shift, slide, rise, tilt, roll, and twist — which I call them stacking or simple step parameters
x-displacement, y-displacement, helical rise, inclination, tip, and helical twist — which I call them helical parameters

3DNA calculates all of these parameters. Over the years, I have been approached quite a few times on the local helical parameters interpretation question. In literature, I've noticed numerous times of confusions the community still has over the two set of parameters, especially with regard to the issue of "twist vs helical twist" and "rise vs. helical rise". This series of blog posts is aimed to clarify the problem by providing some background information and step-by-step worked examples illustrating how the parameters are actually calculated in 3DNA (which follows SCHNAaP/SCHNArP). In my experience, knowing how the parameters are derived is the key to understand what they mean, and thus to avoid misinterpretations.

Part 1 — background information

The Calladine and El-Hassan Scheme (CEHS) calculates only the six step parameters, i.e., shift, slide, rise, tilt, roll, and twist. The original CEHS implementation was a few FORTRAN subroutines taking advantage the code base of Dickerson's well-known NewHelix program. In developing SCHNAaP/SCHNArP, a collaborative project between the then Sheffield and Cambridge groups, I also went through the source code of NewHelix to get a thorough knowledge of its internals in order to integrate its nice features into SCHNAaP.
One thing I noticed was x-displacement, a parameter characterizing the "hole" of A-DNA in top view. The global helical axis from NewHelix was (still is) appealing to me, especially for a short, relatively straight fragment. Even for a clearly non-straight duplex, e.g., in super helical nucleosome core particle DNA, or a severely kinked DNA, the deviation from regular linear helix serves as a good parameter to quantify its overall non-linearity. As a side note, a single so-called bending angle is often misleading — for example, the bending angle as commonly cited in literature (mostly calculated from Curves), is strongly influenced by the two terminal base-pairs.
So I devised a parallel set of global helical parameters following the CEHS scheme of angle combination: here instead of roll-tilt, I used tip-inclination. The definition of global helical axis and the point the helix passes through are based on my simplified implementation in ANSI C of the algorithms used by NewHelix.
Thus SCHNAaP calculates both a set of local CEHS step parameters, and a set of global helical parameters — a unique combination of CEHS and NewHelix, made possible only after a thorough understanding of both methods. SCHNArP was developed to complete the circle, i.e., to rebuild a structure given a set of parameters, either the local step parameters: shift, slide, rise, tilt, roll, and twist, or the global helical parameters: x-displacement, y-displacement, helical rise, inclination, tip, and helical twist.
The RNA (Running Nucleic Acids) program by Babcock et al. calculates a set of local helical parameters. While working at Rutgers in Dr. Olson's group, I was interested in knowing how the local helical axis and the point its passes through were defined in the RNA program, since it would naturally substitute for the global one in SCHNAaP/SCHNArP to make two sets of purely local parameters. As it turned out, however, the algorithm was finally implemented in the 3DNA software package, as summarized in the email I sent out before the 13th Conversation at Albany:
The way to calculate the local helical axis and helical parameters in 3DNA is *essentially the same* as in RNA, but expressed in a much simpler way. First, the local helical axis is calculated as dx-times-dy, following Bansal, which gives the *same* result as the "single rotation axis" detailed in the RNA paper. I still could not figure out WHY, but verified this numerically. The location where the helix passes through is based directly on the RNA paper (i.e., following Chasles's theorem). The procedure to calculate helical parameters, i.e., tip/inclination/x-disp/y-disp/etc, following the SCHNAaP paper (i.e, with tip-inclination combination, which is consistent with the roll-tilt, propeller-buckle combinations in 3DNA). What amazed us is that this much simpler tip-inclination implementation in 3DNA gives *exactly* the same numerical values as the original RNA algorithm.
In summary, the two sets of local parameters as implemented in 3DNA are based on a unique combination of nice features from several programs well-known in the community, including SCHNAaP/SCHNArP (CEHS), NewHelix, RNA, and NUPARM. However, the 3DNA calculated parameters are numerically different from any of them, partially because 3DNA adopts the standard base reference frame. To the best of my knowledge, 3DNA is the only software that allows for rigorous conversion between the two sets of local parameters — a utility program in 3DNA called step_hel illustrates this simple fact. Again, as I wrote before the 13th Conversation at Albany:
Take each base-pair as a rigid block, only 6 parameters are required to relate one bp to the other. Two sets are commonly used: one set is "shift, slide, rise, tilt, roll and twist", and the other set is "x-displacement, y-displacement, helical rise, inclination, tip, and helical twist". Obviously, these two sets of parameters should be directly convertible, as demonstrated by Calladine and Drew in their 1984 JMB B-to-A transition paper, and illustrated in 3DNA manuscript.
Links to other programs referred to in this section:
- SCHNAaP/SCHNArP by Lu et al.
- The RNA program by Babcock et al.
- FreeHelix98 by Dickerson
- NUPARM from Banal et al.
- Standards, especially the email exchanges before the 13th Conversation at Albany

Saturday, October 10, 2009

Blogger's duo-editing modes allow for flexibility and convenience

In my experience using Blogger over the past several months, I have begun to appreciate more its double editing mode: Compose and Edit HTML. The Compose mode is a simple WYSIWYG editor, convenient for most common tasks. Once in a while, however, I get stuck with some nasty formatting issues that could drive one crazy to fix. This is where the Edit HTML mode comes in handy, which allows for full flexibility in editing raw HTML.

In principle, I am pretty competent with HTML and could use the Edit HTML mode directly. However, raw HTML is verbose and thus not that convenient. So I normally start a blog post with the Compose mode for most of the content, and switch to the Edit HTML mode only when necessary. Blogger makes the switch between the two modes a simple button click. This is in contrast to the single editing mode in phpBB3 (BBCode) used by the 3DNA forum, which does not allow for direct access to HTML.

Ideally, a software tool should be both flexible and convenient. In reality, however, not that many software could strike a balance between the two factors. Blogger is a nice example. Interesting, in composing this post, I switched back and forth between the two editing modes a couple of occasions: one after copy-and-pasting Edit HTML to stop red coloring of the following text, and the other to qualify the 3DNA forum link text.

Friday, October 9, 2009

Fiber models in 3DNA make it easy to build regular DNA helices

3DNA contains 55 fiber models compiled from literature. To the best of my knowledge, this is the most comprehensive collection of its kind (detailed list given below), including:

Chandrasekaran & Arnott (from #1 to #43) — the most well-known set of fiber models
Alexeev et al. (#44-#45)
van Dam & Levitt (#46-#47)
Premilat & Albiser (#48-#55)

The utility program fiber puts the generation of all these fiber models in a simple, consistent interface, and can generate structures in either standard PDB or PDBML format. Of those models, some can be built with an arbitrary sequence of A, C, G and T (e.g., A-/B-/C-DNA from calf thymus), while others are of fixed sequences (e.g., Z-DNA with GC repeats). The sequence can be specified either from command-line or a plain text file.

Once 3DNA in properly installed, the command-line interface is the most versatile and convenient way to generate, e.g., a regular double strand DNA (mostly, B-DNA) of arbitrary sequence. Moreover, the w3DNA and 3DART web-interfaces to 3DNA make it easy to generate a regular DNA model, especially for occasional use or educational purposes.

Theoretically, there is nothing to worth showing off in 3DNA's fiber model generation functionality. However, it serves as a clear example of the differences between a "proof of concept" and a practical software application. I initially decided to work on this issue simply for my own convenience. At that time, I had access to A-DNA and B-DNA fiber model generators, each as a separate program. Moreover, the constructed models did not comply to the PDB format in atom naming and other subtitles.

I started with the Chandrasekaran & Arnott fiber models which I had a copy of data files. Nevertheless, there are many details to work out, typos to correct, etc to put them in a consistent framework. For other models, I read each original publication, and typed into computer raw atomic cylindrical coordinates for each model. Again, quite a few inconsistencies popped up between the different publications with a time span over decades.

Overall, it was a quite tedious undertaking, requiring great attention to details. I am glad that I did that: I learned so much from the process, and more importantly, others can benefit from my effort. As I put in the 3DNA Nature Protocol paper (BOX 6 | FIBER-DIFFRACTION MODELS),

In preparing this set of fiber models, we have taken great care to ensure the accuracy and consistency of the models. For completeness and user verification, 3DNA includes, in addition to 3DNA-processed files, the original coordinates collected from the literature.

For those who want to understand what's going on under the hood, there is no better way than to try to reproduce the process using, e.g., fiber B-DNA as an example.

From the very beginning, I had expected the fiber functionality to be easily reachable and thus beneficial to all those who are interested in nucleic acid structures, especially to build a regular DNA duplex of chosen sequence. In my sense, the fiber program has not been widely used as it should have been, probably due to the fact that many people are not aware of its existence or capability. Hopefully, this blog post would help to get my message across.

PS. Given below is the content of the README file for fiber models in 3DNA

1. The repeating units of each fiber structure are mostly based on the
   work of Chandrasekaran & Arnott (from #1 to #43). More recent fiber
   models are based on Alexeev et al. (#44-#45), van Dam & Levitt (#46
   -#47) and Premilat & Albiser (#48-#55).

2. Clean up of each residue
   a. currently ignore hydrogen atoms [can be easily added]
   b. change ME/C7 group of thymine to C5M
   c. re-assign O3' atom to be attached with C3'
   d. change distance unit from nm to A [most of the entries]
   e. re-ordering atoms according to the NDB convention

3. Fix up of problem structures.
   a. str#8 has no N9 atom for guanine
   b. str#10 is not available from the disk, manually input
   c. str#14 C5M atom was named C5 for Thymine, resulting two C5 atoms
   d. str#17 has wrong assignment of O3' atom on Guanine
   e. str#33 has wrong C6 position in U3
   f. str#37 to #str41 were typed in manually following Arnott's
        new list as given in "Oxford Handbook of Nucleic Acid Structure"
        edited by S. Neidle (Oxford Press, 1999)
   g. str#38 coordinates for N6(A) and N3(T) are WRONG as given in the
        original literature
   h. str#39 and #40 have the same O3' coordinates for the 2nd strand

4. str#44 & 45 have fixed strand II residues (T)

5. str#46 & 47 have +z-axis upwards (based on BI.pdb & BII.pdb)

6. str#48 to 55 have +z-axis upwards

List of 55 fiber structures

id#  Twist   Rise        Structure description
    (dgrees)  (A)
-------------------------------------------------------------------------------
 1   32.7   2.548  A-DNA  (calf thymus; generic sequence: A, C, G and T)
 2   65.5   5.095  A-DNA  poly d(ABr5U) : poly d(ABr5U)
 3    0.0  28.030  A-DNA  (calf thymus) poly d(A1T2C3G4G5A6A7T8G9G10T11) :
                                        poly d(A1C2C3A4T5T6C7C8G9A10T11)
 4   36.0   3.375  B-DNA  (calf thymus; generic sequence: A, C, G and T)
 5   72.0   6.720  B-DNA  poly d(CG) : poly d(CG)
 6  180.0  16.864  B-DNA  (calf thymus) poly d(C1C2C3C4C5) : poly d(G6G7G8G9G10)
 7   38.6   3.310  C-DNA  (calf thymus; generic sequence: A, C, G and T)
 8   40.0   3.312  C-DNA  poly d(GGT) : poly d(ACC)
 9  120.0   9.937  C-DNA  poly d(G1G2T3) : poly d(A4C5C6)
10   80.0   6.467  C-DNA  poly d(AG) : poly d(CT)
11   80.0   6.467  C-DNA  poly d(A1G2) : poly d(C3T4)
12   45.0   3.013  D-DNA  poly d(AAT) : poly d(ATT)
13   90.0   6.125  D-DNA  poly d(CI) : poly d(CI)
14  -90.0  18.500  D-DNA  poly d(A1T2A3T4A5T6) : poly d(A1T2A3T4A5T6)
15  -60.0   7.250  Z-DNA  poly d(GC) : poly d(GC)
16  -51.4   7.571  Z-DNA  poly d(As4T) : poly d(As4T)
17    0.0  10.200  L-DNA  (calf thymus) poly d(GC) : poly d(GC)
18   36.0   3.230  B'-DNA alpha poly d(A) : poly d(T) (H-DNA)
19   36.0   3.233  B'-DNA beta2 poly d(A) : poly d(T) (H-DNA  beta)
20   32.7   2.812  A-RNA  poly (A) : poly (U)
21   30.0   3.000  A'-RNA poly (I) : poly (C)
22   32.7   2.560  Hybrid poly (A) : poly d(T)
23   32.0   2.780  Hybrid poly d(G) : poly (C)
24   36.0   3.130  Hybrid poly d(I) : poly (C)
25   32.7   3.060  Hybrid poly d(A) : poly (U)
26   36.0   3.010  10-fold poly (X) : poly (X)
27   32.7   2.518  11-fold poly (X) : poly (X)
28   32.7   2.596  Poly (s2U) : poly (s2U) (symmetric base-pair)
29   32.7   2.596  Poly (s2U) : poly (s2U) (asymmetric base-pair)
30   32.7   3.160  Poly d(C) : poly d(I) : poly d(C)
31   30.0   3.260  Poly d(T) : poly d(A) : poly d(T)
32   32.7   3.040  Poly (U) : poly (A) : poly(U) (11-fold)
33   30.0   3.040  Poly (U) : poly (A) : poly(U) (12-fold)
34   30.0   3.290  Poly (I) : poly (A) : poly(I)
35   31.3   3.410  Poly (I) : poly (I) : poly(I) : poly(I)
36   60.0   3.155  Poly (C) or poly (mC) or poly (eC)
37   36.0   3.200  B'-DNA beta2  Poly d(A) : poly d(U)
38   36.0   3.240  B'-DNA beta1  Poly d(A) : poly d(T)
39   72.0   6.480  B'-DNA beta2  Poly d(AI) : poly d(CT)
40   72.0   6.460  B'-DNA beta1  Poly d(AI) : poly d(CT)
41  144.0  13.540  B'-DNA  Poly d(AATT) : poly d(AATT)
42   32.7   3.040  Poly(U) : poly d(A) : poly(U) [cf. #32]
43   36.0   3.200  Beta Poly d(A) : Poly d(U) [cf. #37]
44   36.0   3.233  Poly d(A) : poly d(T) (Ca salt)
45   36.0   3.233  Poly d(A) : poly d(T) (Na salt)
46   36.0   3.38   B-DNA (BI-type nucleotides; generic sequence: A, C, G and T)
47   40.0   3.32   C-DNA (BII-type nucleotides; generic sequence: A, C, G and T)
48   87.8   6.02   D(A)-DNA  ploy d(AT) : ploy d(AT) (right-handed)
49   60.0   7.20   S-DNA  ploy d(CG) : poly d(CG) (C_BG_A, right-handed)
50   60.0   7.20   S-DNA  ploy d(GC) : poly d(GC) (C_AG_B, right-handed)
51   31.6   3.22   B*-DNA  poly d(A) : poly d(T)
52   90.0   6.06   D(B)-DNA  poly d(AT) : poly d(AT) [cf. #48]
53  -38.7   3.29   C-DNA (generic sequence: A, C, G and T) (depreciated)
54   32.73  2.56   A-DNA (generic sequence: A, C, G and T) [cf. #1]
55   36.0   3.39   B-DNA (generic sequence: A, C, G and T) [cf. #4]
-------------------------------------------------------------------------------
List 1-41 based on Struther Arnott: ``Polynucleotide secondary structures:
     an historical perspective'', pp. 1-38 in ``Oxford Handbook of Nucleic
     Acid Structure'' edited by Stephen Neidle (Oxford Press, 1999).
     
     #42 and #43 are from Chandrasekaran & Arnott: "The Structures of DNA
     and RNA Helices in Oriented Fibers", pp 31-170 in "Landolt-Bornstein
     Numerical Data and Functional Relationships in Science and Technology"
     edited by W. Saenger (Springer-Verlag, 1990).

#44-#45 based on Alexeev et al., ``The structure of poly(dA) . poly(dT)
     as revealed by an X-ray fiber diffraction''. J. Biomol. Str. Dyn, 4,
     pp. 989-1011, 1987.

#46-#47 based on van Dam & Levitt, ``BII nucleotides in the B and C forms
     of natural-sequence polymeric DNA: a new model for the C form of DNA''.
     J. Mol. Biol., 304, pp. 541-561, 2000.

#48-#55 based on Premilat & Albiser, ``A new D-DNA form of poly(dA-dT) .
     poly(dA-dT): an A-DNA type structure with reversed Hoogsteen Pairing''.
     Eur. Biophys. J., 30, pp. 404-410, 2001 (and several other publications).

Sunday, October 4, 2009

3DNA in molecular dynamics simulations

While updating 3DNA citations last Friday, I came across the paper "Flexibility of Short-Strand RNA in Aqueous Solution as Revealed by Molecular Dynamics Simulation: Are A-RNA and A′-RNA Distinct Conformational Structures?" by Ouyang et al. [Aust. J. Chem. 2009, 62, 1054–1061]. Through molecular dynamics (MD) simulations over a 30 ns period, the authors found that "the identification of distinct A-RNA and A′-RNA structures ... may not be generally relevant in the context of RNA in the aqueous phase." Overall, the paper is nicely written.

I have never performed MD simulations in my research experience, so normally I only read abstracts of such publications just to get a general idea of the main conclusions. What had attracted my attention to this work was its simultaneous citations to three earlier papers:

Resolving the discrepancies among nucleic acid conformational analyses.

Overview of nucleic acid analysis programs.

A standard reference frame for the description of nucleic acid base-pair geometry.

This is quite unusual, since nowadays it is far more common to only cite 3DNA itself (mostly 2003 NAR and/or 2008 NP). So I decided to have a look of the whole paper. It turned out that the authors used the extra citations to justify their choice of using 3DNA instead of Curves to calculate the helical parameters, including "the three main structural descriptors commonly used to differentiate between the two forms of RNA – namely major groove width, inclination and the number of base pairs in a helical twist [turn]".

I communicated to the authors about the availability of Curves+. Specifically, using one case (413D, one of the three structures in their Table 1, "Comparison of different results by CURVES and 3DNA programs"), I've tried to illustrate the point that Curves+ and 3DNA now give directly comparable parameters.

Of course, I am glad to see 3DNA being applied to molecular dynamics simulations of nucleic acid structures. Hopefully, more such applications will show up in the future, and I am willing to offer my help in ways that make sense to me.

Saturday, October 3, 2009

Whenever in doubt, check with the author

Once in a while, I send emails to authors of papers I am interested in, sometimes simply to ask for PDF reprints, mostly to request for clarifications of points I cannot understand fully. Of course, the responses I have received vary significantly: some authors are responsive and are able to answer my questions concretely; while others respond less professionally; in no small percentage, I get no feedback at all. Whatever the case, though, sending querying emails is convenient, and the responses I get (even no response at all) are informative. Naturally, I would take more seriously the papers whose authors are responsive. On the other hand, in my memory, I have never ignored a reader's question of my publications.

Seeking clarification on a scientific software from author(s) or maintainer(s) is even more important due to inherent subtleties of (undocumented) details, as is common in (bio)informatics. In supporting 3DNA over the years, I've experienced quite a few cases where authors of some articles are misinformed in making judgment about 3DNA's functionality. In one case, I read in a paper claiming 3DNA cannot handle Hoogsteen base-pairs while Curves can. A few email exchanges with the corresponding author (who was very responsive and professional) turned out that an internally modified version of Curves was used. More recently, I found a paper claiming that find_pair from 3DNA failed to identify some base pairs in DNA-protein complexes where a new method succeeded. I asked for the missing list, and immediately noticed that simply relaxing some criteria recovered virtually all of the pairs. Thus, to make a convincing comparison of scientific software, it is crucial to check with the original authors to avoid misunderstandings. Serious scientific software developers and maintainers always welcome users' feedback. Why not ask for clarifications if one really wants to make a (strong) point in comparison? Of course, it is another story for unsupported software.

The Internet age has provided unprecedented convenience for scientific communication. It would be a pity not to take full advantage of it. One simple and important thing to do is: whenever in doubt, ask for clarification from corresponding author of a publication or maintainer of a software.

Sunday, September 27, 2009

On reproducibility of scientific publications

In the September 25, 2009 issue of Science (Vol. 325, pp.1622-3), I read with interest the letter from Osterweil et al. "Forecast for Reproducible Data: Partly Cloudy" and the response from Nelson. This exchange of views highlights the difficulty/importance for one research team to precisely reproduce results from another when elaborate computation is involved. As is well known, subtle differences in computer hardware and software, different versions of the same software, or even different options of the same version, could all play a role. Without specifying those details, it is virtually impossible to repeat a publication exactly.

This reminds me of a recent paper "Repeatability of published microarray gene expression analyses" by Ioannidis et al. [Nat Genet. 2009, 41(2):149-55]. In the abstract, the authors summarized their findings:

Here we evaluated the replication of data analyses in 18 articles on microarray-based gene expression profiling published in Nature Genetics in 2005-2006. One table or figure from each article was independently evaluated by two teams of analysts. We reproduced two analyses in principle and six partially or with some discrepancies; ten could not be reproduced. The main reason for failure to reproduce was data unavailability, and discrepancies were mostly due to incomplete data annotation or specification of data processing and analysis.

Specifically, please note that:

The authors are experts on microarray analysis, not occasional application software users.
These 18 articles surveyed were published in Nature Genetics, one of the top journals of its field.
Not a single analysis could be reproduced exactly: two were reproduced in principle, six only partially, and the other ten not at all.

Without being able to reproduce exactly the results from others, it is hard to build upon previous work and move forward. The various reasons for lack of reproducibility listed by Ioannidis et al. are certainly not limited to microarray analysis. As far as I can tell, they also exist in the fields of RNA structure analysis and predictions, energetics of protein-DNA interactions, quantum mechanics calculations, and molecular dynamics simulations etc.

In my experience and understanding, the methods section in journal articles is not, and should not aim to be, detailed enough for exact duplication by a qualified reader. Instead, most such reproducibility issues would be gone if journals require that authors provide raw data, detailed procedures used to process the data, software version and options used to generate the figures and tables reported in the publication. Such information could be made available in (journal or authors) websites. This is an effective way to solve the problem, especially for computational, informatics-related articles. Over the years, for papers I am the first author or I have made major contributions, I've always kept a folder for each article to include every detail (data files, scripts etc) so that the published tables and figures can be repeated precisely. This has turned out to be extremely helpful when I want to refer back to early publications, or when I was asked by readers for further details.

As noted by Osterweil et al., "repeatability, reproducibility, and transparency are the hallmarks of the scientific enterprise." To really achieve the goal, every scientist needs to pay more attention to details and be responsive. Do not be fool around by the impressive introduction or extensive discussions (which are important, of course) in a paper: to get the bottom of something, it is usually the details that count.