Sunday, September 27, 2009

On reproducibility of scientific publications

In the September 25, 2009 issue of Science (Vol. 325, pp.1622-3), I read with interest the letter from Osterweil et al. "Forecast for Reproducible Data: Partly Cloudy" and the response from Nelson. This exchange of views highlights the difficulty/importance for one research team to precisely reproduce results from another when elaborate computation is involved. As is well known, subtle differences in computer hardware and software, different versions of the same software, or even different options of the same version, could all play a role. Without specifying those details, it is virtually impossible to repeat a publication exactly.

This reminds me of a recent paper "Repeatability of published microarray gene expression analyses" by Ioannidis et al. [Nat Genet. 2009, 41(2):149-55]. In the abstract, the authors summarized their findings:
Here we evaluated the replication of data analyses in 18 articles on microarray-based gene expression profiling published in Nature Genetics in 2005-2006. One table or figure from each article was independently evaluated by two teams of analysts. We reproduced two analyses in principle and six partially or with some discrepancies; ten could not be reproduced. The main reason for failure to reproduce was data unavailability, and discrepancies were mostly due to incomplete data annotation or specification of data processing and analysis.

Specifically, please note that:
  1. The authors are experts on microarray analysis, not occasional application software users.
  2. These 18 articles surveyed were published in Nature Genetics, one of the top journals of its field.
  3. Not a single analysis could be reproduced exactly: two were reproduced in principle, six only partially, and the other ten not at all.
Without being able to reproduce exactly the results from others, it is hard to build upon previous work and move forward. The various reasons for lack of reproducibility listed by Ioannidis et al. are certainly not limited to microarray analysis. As far as I can tell, they also exist in the fields of RNA structure analysis and predictions, energetics of protein-DNA interactions, quantum mechanics calculations, and molecular dynamics simulations etc.

In my experience and understanding, the methods section in journal articles is not, and should not aim to be, detailed enough for exact duplication by a qualified reader. Instead, most such reproducibility issues would be gone if journals require that authors provide raw data, detailed procedures used to process the data, software version and options used to generate the figures and tables reported in the publication. Such information could be made available in (journal or authors) websites. This is an effective way to solve the problem, especially for computational, informatics-related articles. Over the years, for papers I am the first author or I have made major contributions, I've always kept a folder for each article to include every detail (data files, scripts etc) so that the published tables and figures can be repeated precisely. This has turned out to be extremely helpful when I want to refer back to early publications, or when I was asked by readers for further details.

As noted by Osterweil et al., "repeatability, reproducibility, and transparency are the hallmarks of the scientific enterprise." To really achieve the goal, every scientist needs to pay more attention to details and be responsive. Do not be fool around by the impressive introduction or extensive discussions (which are important, of course) in a paper: to get the bottom of something, it is usually the details that count.

No comments:

Post a Comment

You are welcome to make a comment. Just remember to be specific and follow common-sense etiquette.