Saturday, October 3, 2009

Whenever in doubt, check with the author

Once in a while, I send emails to authors of papers I am interested in, sometimes simply to ask for PDF reprints, mostly to request for clarifications of points I cannot understand fully. Of course, the responses I have received vary significantly: some authors are responsive and are able to answer my questions concretely; while others respond less professionally; in no small percentage, I get no feedback at all. Whatever the case, though, sending querying emails is convenient, and the responses I get (even no response at all) are informative. Naturally, I would take more seriously the papers whose authors are responsive. On the other hand, in my memory, I have never ignored a reader's question of my publications.

Seeking clarification on a scientific software from author(s) or maintainer(s) is even more important due to inherent subtleties of (undocumented) details, as is common in (bio)informatics. In supporting 3DNA over the years, I've experienced quite a few cases where authors of some articles are misinformed in making judgment about 3DNA's functionality. In one case, I read in a paper claiming 3DNA cannot handle Hoogsteen base-pairs while Curves can. A few email exchanges with the corresponding author (who was very responsive and professional) turned out that an internally modified version of Curves was used. More recently, I found a paper claiming that find_pair from 3DNA failed to identify some base pairs in DNA-protein complexes where a new method succeeded. I asked for the missing list, and immediately noticed that simply relaxing some criteria recovered virtually all of the pairs. Thus, to make a convincing comparison of scientific software, it is crucial to check with the original authors to avoid misunderstandings. Serious scientific software developers and maintainers always welcome users' feedback. Why not ask for clarifications if one really wants to make a (strong) point in comparison? Of course, it is another story for unsupported software.

The Internet age has provided unprecedented convenience for scientific communication. It would be a pity not to take full advantage of it. One simple and important thing to do is: whenever in doubt, ask for clarification from corresponding author of a publication or maintainer of a software.

Sunday, September 27, 2009

On reproducibility of scientific publications

In the September 25, 2009 issue of Science (Vol. 325, pp.1622-3), I read with interest the letter from Osterweil et al. "Forecast for Reproducible Data: Partly Cloudy" and the response from Nelson. This exchange of views highlights the difficulty/importance for one research team to precisely reproduce results from another when elaborate computation is involved. As is well known, subtle differences in computer hardware and software, different versions of the same software, or even different options of the same version, could all play a role. Without specifying those details, it is virtually impossible to repeat a publication exactly.

This reminds me of a recent paper "Repeatability of published microarray gene expression analyses" by Ioannidis et al. [Nat Genet. 2009, 41(2):149-55]. In the abstract, the authors summarized their findings:
Here we evaluated the replication of data analyses in 18 articles on microarray-based gene expression profiling published in Nature Genetics in 2005-2006. One table or figure from each article was independently evaluated by two teams of analysts. We reproduced two analyses in principle and six partially or with some discrepancies; ten could not be reproduced. The main reason for failure to reproduce was data unavailability, and discrepancies were mostly due to incomplete data annotation or specification of data processing and analysis.

Specifically, please note that:
  1. The authors are experts on microarray analysis, not occasional application software users.
  2. These 18 articles surveyed were published in Nature Genetics, one of the top journals of its field.
  3. Not a single analysis could be reproduced exactly: two were reproduced in principle, six only partially, and the other ten not at all.
Without being able to reproduce exactly the results from others, it is hard to build upon previous work and move forward. The various reasons for lack of reproducibility listed by Ioannidis et al. are certainly not limited to microarray analysis. As far as I can tell, they also exist in the fields of RNA structure analysis and predictions, energetics of protein-DNA interactions, quantum mechanics calculations, and molecular dynamics simulations etc.

In my experience and understanding, the methods section in journal articles is not, and should not aim to be, detailed enough for exact duplication by a qualified reader. Instead, most such reproducibility issues would be gone if journals require that authors provide raw data, detailed procedures used to process the data, software version and options used to generate the figures and tables reported in the publication. Such information could be made available in (journal or authors) websites. This is an effective way to solve the problem, especially for computational, informatics-related articles. Over the years, for papers I am the first author or I have made major contributions, I've always kept a folder for each article to include every detail (data files, scripts etc) so that the published tables and figures can be repeated precisely. This has turned out to be extremely helpful when I want to refer back to early publications, or when I was asked by readers for further details.

As noted by Osterweil et al., "repeatability, reproducibility, and transparency are the hallmarks of the scientific enterprise." To really achieve the goal, every scientist needs to pay more attention to details and be responsive. Do not be fool around by the impressive introduction or extensive discussions (which are important, of course) in a paper: to get the bottom of something, it is usually the details that count.