Friday, October 23, 2009

How to quantify the relative geometry between two base-pairs? — Part 2

In part I of this series, and indeed other occasions in my blog, 3DNA home page and forum, I have attributed to several related programs where 3DNA has benefited from. Before going into details on how the various parameters are calculated in 3DNA, however, I feel it is in order to make it clear my philosophy on creating scientific software in general, and 3DNA in particular.

Some underlying considerations

  • Whenever possible/applicable, I always try to get a thorough understanding of related software available. By thorough I mean at the source code level to see how an algorithm is implemented. The benefits of doing this are several folds:
    1. Not to recreate the wheel, but to build upon previous work.
    2. It is the most effective way to learn — many math formulae or text descriptions of algorithms, while useful for getting general ideas, are lack in details or vague in nature. In bioinformatics, most of the fundamental mathematics are not new. It is more about a novel combination of various known-parts, applying to a specific problem. Reading the source code is the only way to see unambiguously the implementation details.
    3. With a clear understanding of how the parameters are actually calculated, one can make better use of a software tool, know its limitations and thus avoid misinterpretations.
    4. To make objective, convincing (even to the original authors) comparisons/comments on existing tools, and importantly, to create something better if needed.
  • Specially, 3DNA has benefited the most from my SCHNAaP/SCHNArP programs: the underlying algorithms in 3DNA for calculating the various structural parameters (propeller, buckle, shear, roll, slide, x-displacement, inclination etc) are exactly the same as there — analyze and rebuild are direct derivatives from SCHNAaP (a for analysis) and SCHNArP (r for rebuilding/reconstruction), respectively; the ideas of standard stacking diagrams, the Zp parameter, building atomic models with sugar-phosphate backbone, and the base/bp rectangular block representations are also from SCHNAaP/SCHNArP. I also borrowed ideas from Babcock's RNA, Bansal's NUPARM, Dickerson's NewHelix/FreeHelix etc., which will be mentioned explicitly in following sessions.
  • Code wise, 3DNA was created from scratch in strict ANSI C (over 25K lines). In addition to implementing a unique combination of existing algorithms, I have added many significant novel methods including the find_pair program. However, feature-rich has never been my goal. Instead, I only consider to add new functions that I understand clearly and find useful. On the other hand, I am always quick to fix bugs.
Overall, it seems that I am an outlier rather than norm in scientific programming. The reasons could be as follows:
  1. It is hard to understand other people's implementations of algorithms since it needs to read between the lines. This is especially the case with a unfamiliar computer language; undocumented, or bad code. It is easier to create one's own program than to understand and modify third-party software.
  2. For publication purpose, it is more attractive to work on something "new" instead of building on others.
Thus, in bioinformatics field, there are far more "new" methods papers published than refinements of previous ones. However, what claimed/appeared to be novel may not be the case under the hook, due to different terminologies used in different fields. Moreover, while it is easy to create a self-claimed "new" program, others could simply do the same. In my experience, it is well worth the effort to really understand leading programs if one is serious about getting into an informatics field: it was really revealing and exciting when I went thorough CEHS, NewHelix/FreeHelix, RNA, Curves etc. Then I know clearly what was "the start-of-the-art" of the field and how I can do better: that's exactly how 3DNA came about!

In the following sessions, I will provide details on how 3DNA calculates the two sets of parameters to quantify the relative geometry of a dinucleotide step.

No comments:

Post a Comment

You are welcome to make a comment. Just remember to be specific and follow common-sense etiquette.