Xiang-Jun's Corner

Saturday, June 13, 2009

blocview: a simple, effective visualization tool for nucleic acid structures

The 'blocview' Perl script in 3DNA was designed as a simple tool that nevertheless illustrates key features of nucleic acid structures effectively. Specifically, the 'bloc' part of the name means 'block', i.e., the base rectangular block in the Calladine-Drew style to show clearly the size (larger purine vs. smaller pyrimidine), identity (by color: red for A, yellow for C, green for G, and blue for T), and the groove (with minor groove edge filled in black); and the 'view' part means the most extended view (as defined by the principal axes of inertia). 'blocview' calls several utility programs of 3DNA and MolScript (for protein ribbons and nucleic acid backbone rods) to prepare the scenes, and then uses Raster3D (more specifically 'render') or PyMol (as of 3DNA v2.0) to generate a PNG image.

It has been my pleasure to notice that 'blocview' generated images are used in the NDB for each and every structure. See for example, an arbitrarily picked link on X-ray Drug-DNA Complexes. A nice thing about these images is that they can be generated automatically: using PDB id 1z8v as an example, the following command

blocview -i=1z8v.png 1z8v.pdb

would produce the image (named 1z8v.png above) shown in the right. In this representation, one can see clearly that there are two unpaired Gs (green) at each of the 5' end of the two DNA chains (red and yellow rods), and a drug molecule (ball-and-stick) binds in the minor groove (black edge of the rectangular blocks). Moreover, the first C-G pair has pronounced negative propeller twist.

It was a nice surprise when I found (several years ago) that such simple images have also be adopted by the PDB, prominently at the summary page, for every nucleic acid containing structure. There are so many sophisticated molecular graphics tools out there, why 'blocview'? After all, it is just a small utility tool in the 3DNA suite of programs, and I have never written a paper on it. Clearly, 'blocview' fills a niche.

On the other hand, given the effectiveness of this simple representation and its adoption by the NDB and PDB, I am surprised to found that 'blocview' images are not used that much in the literature. Maybe the situation will change as 3DNA gets more widely used, and when people pay more attention to its versatile functionalities: certainly, there are more handy features in 3DNA!

More blocview command-line options are available to suit some other common needs: just type 'blocview -h' for details. Furthermore, many aspects of the images can be fine-tuned by configuration files. Astute viewers could have noticed subtle differences of the 1z8v images shown here (above right), and in the NDB and PDB websites. Since 'blocview' is written in Perl script, users can easily check the implementation details to figure out exactly how it works.

Friday, June 12, 2009

Review of scientific manuscripts

While reading an editorial of Maddox [Nature. 1995 Dec 7; 378(6557):521-3], I found the following comment quite interesting:

It's a good joke (which I have often used) that Watson and Crick's paper on the structure of DNA could not be published now. If is only necessary to imagine what people would say if it reached them in the mail: "It's all model-building, just speculation, and such data they have are not theirs but Rosalind Franklin's!" Some would complain that the sentence beginning, "It has not escaped our attention..." is certainly unsubstantiated, and must be an attempt to claim credit for developments in genetics that lies years ahead.

Maddox continued to imagine what could have happened behind the scene that led to the publication of Watson and Crick's paper, together with Franklin's crucial experimental work illustrating the helical signature of crystalline DNA.

In some sense (to my understanding), Watson and Crick were lucky, not only in that they solved the puzzle of DNA structure elegantly by connecting the various experimental pieces together, but also they worked in the Cavendish Laboratory at Cambridge University and had the support from Sir Lawrence Bragg. One could easily imagine what would have happened if they did this work in a not so famous lab, without support from a leading, powerful figure.

In a scientific field (e.g. RNA), what appears to be a common sense, a decade old issue, could have something significant when view from a novel perspective. Such unconventional results, however, are often hard to get across the review process, especially for new comers to a field and without backing of big names. In principle, the reviewers should be fair to the author, being specific with his/her comments, especially in case of negative ones. That would be far more convincing to the authors than some vague, general remarks. As a general rule, I won't accept to review a manuscript for a journal unless I have time to read it thorough carefully, and the expertise to understand the details, in order to provide some concrete comments.

In this regard, it is worth reading the essay by Keith Manchester titled "Historical Opinion: Erwin Chargaff and his 'rules' for the base composition of DNA: why did he fail to see the possibility of complementarity?" [Trends Biochem Sci. 2008 Feb;33(2):65-70]. For completeness, the abstract is quoted below:

Erwin Chargaff was one of the more interesting and colourful figures of the historic decade that heralded the proposal of the double helical structure of DNA by Watson and Crick in 1953. In describing Chargaff's important contribution to the study of DNA, particularly its base composition, this article seeks to suggest why, despite his substantial achievements, he failed to anticipate some of the key features of the Watson-Crick model, particularly complementarity between bases--a failure that left him deeply embittered for the rest of his life.

Saturday, June 6, 2009

ANSI C program memory check with valgrind

Over the years, ANSI C has been my selected language for scientific computations, from SCHNAaP/SCHNArP, 3DNA, to my current research projects. As noted by K&R in the preface of "The C Programming Language" book (2nd edition), C "wears well as one's experience with it grows". To me, ANSI C is a language that the better I know it, the more I love it.

Using gcc with strict compiler options -ansi -pedantic -Werror -Wall -W -Wshadow, my ANSI C programs (e.g., 3DNA) compile cleanly without any change on all the OSes I have access to. Any C programmer should be well aware of the memory leak type of bugs that are so difficult to fix, since it is not checked by gcc. That's where valgrind comes in and I have been using it in 3DNA and my current projects on a regular basis.

The usefulness of the valgrind toolset becomes obvious when a SCHNArP user posted a question about some possible memory leak bugs. Traditionally, SCHNArP works interactively from command line and builds one model (DNA structure) at a time. When the program exists, any meomry leaks are reclaimed by the system so such bugs do not show up. At the time SCHNAaP/SCHNArP was developed, I was not aware of valgrind (or maybe it even did not exist yet!). The user has been using some functions from SCHNArP to build many structures automatically, thus any small amount of memory leaks scales up and eventually aborts out-of-memory. As a test case, the user supplied me with a reproducible example as follows:

    while(1){
        GLH_build();    
    }

Without knowing of valgrind and actually applied it in this situation, fixing such bugs would have been too time consuming than I could manage to spend. However, with the help of valgrind, solving this puzzle did not take that long at all. More specifically, recompiling the sample program with -g option to have debug information, and running it with valgrind --leak-check=full prefixed immediately revealed a couple of locations with memory leaks. Knowing where the problems are, fixing them became straightforward. Now SCHNArP is more robust.

Overall, valgrind is an essential tool for any serious C programers.

Friday, June 5, 2009

3DNA citations reach over 300

When checking Web of Science today, I found, to my great satisfaction, that citations to the 3DNA 2003 NAR paper have reached over 300. While the number is still not comparable to true classics, it is nevertheless a significant one. It is worth noting that the 300+ citations span 77 scientific journals in the broad biomedical and chemical research fields, including Cell, Nature and Science (well, if CNS really means anything). The top six journals (with over double-digit citations) are:

49 -- NUCLEIC ACIDS RESEARCH
27 -- JOURNAL OF MOLECULAR BIOLOGY
23 -- JOURNAL OF THE AMERICAN CHEMICAL SOCIETY
19 -- BIOCHEMISTRY
13 -- JOURNAL OF PHYSICAL CHEMISTRY B
11 -- BIOPHYSICAL JOURNAL

In retrospect, the wide acceptance of 3DNA by the community is certainly not by chance. Listed below are some of the major factors:

3DNA is built upon the best selected features from a thorough understanding of seven popular DNA structural analysis programs, including Curves, NewHelix/FreeHelix, CEHS, CompDNA, and RNA (Running Nucleic Acids by Babcock and Olson). This is summarized back in June 2003, nearly five years ago, in an email I sent to a group of experts before a Workshop at the 13th Conversation at Albany (NY):

Looking back, 3DNA has never been intended to set up a new standard for the exact definition of various parameters. We selected what made best sense to us from what were already available. This was made possible by knowing the exact technical details of how other programs work. In the meantime, 3DNA also provide a program called "cehs" which gives the original CEHS/SCHNAaP parameters, to which Freehelix parameters would be similar. Believe it or not, "cehs -r" would give the authentic RNA parameters. We also see Curves unique in defining global parameters, bending analysis and groove dimensions."
Active support through emails and the 3DNA forum. Over the years, I have always striven to get back to each 3DNA-related question quickly and concretely. In my memory, I have never ignored users' questions. Instead, I have taken each and every question as a way to improve 3DNA, and refined the code accordingly.
3DNA's integrated approach that incorporates structural analysis, modeling building, and visualization in a single suite of software package. Our 2008 Nature Protocols paper provides some examples.
3DNA has been extensively checked against all NDB entries before each major release, so that I can confidently say that it works in real world, not just in some contrived limited example cases.

I still remember that the number of citations to 3DNA was less than 150 nearly two years ago, when I started to wrote the first draft of our 2008 Nature Protocols paper. Now it is more than doubled! I would blog on this topic again when the number reaches 500.

In my opinion, some of 3DNA features are still (heavily) underused. Now that we have a sizable user community, 3DNA could only become better and would be more widely used. I have every reason to believe that in the not-so-distant-future, the citations to 3DNA would reach over 1000.

For current status of 3DNA citations, check Google Scholar.

Monday, May 25, 2009

Examples make it clear

In reading scientific publications, I always quickly get confused with long, complicated equations. Initially, I thought it was because of my lack of domain knowledge: when I get more familiar with a field, things will become easier. Of course, this statement is true. However, it is not exactly the point in the following two senses:

Having been in the field of nucleic acid structures for more than ten years, I can claim I know the field good enough. Yet, I can still easily get lost with large portions of mathematical formula in an article. An example is the Babcock and Olson 1994 JMB paper titled "Nucleic acid structure analysis. Mathematics for local Cartesian and helical structure parameters that are truly comparable between structures". Overall, it is a solid piece of work but I doubt how many people in this field really understand it above the conceptual level.
When I get to the bottom of something, things do become truly simple. But even then the equations still feel complicated. In resolving the discrepancies among nucleic acid conformational analyses, I went through the source code to understanding from the bottom up how the Babcock and Olson method works. It is actually not that complicated, and the implementation with regard to finding the local helical axis and calculating a set of helical parameters can be greatly simplified using a geometric approach, as in SCHNAaP/3DNA. Moreover, some details from the above Babcock and Olson 1994 JMB paper are actually inaccurate.

Another wonderful case I have had is with the CEHS scheme (due to Calladine and El Hassan) to calculate a set of local base-pair parameters. It was only after checking an example did I uncover a subtle discrepancy between the simple CEHS formula (in its original form) and its description, in the draft version of Dr. El Hassan's PhD thesis. This experience led to the joint SCHNAaP/SCHNArP papers with El Hassan, and inspired me later on to work through NewHelix, Curves, CompDNA etc DNA structure analysis programs to reveal "the reasons why the different algorithms come to conflicting structural interpretations".

So, if you are a beginner in a research field, do not be scared off by the fancy mathematical formula. If you want to really understand something, work through an example and/or check through the source code implementations. For the authors, if you would like to get your message across, provide step-by-step examples to the key points you want to make for others to follow.

Saturday, May 16, 2009

Fiber, analyze and rebuild in 3DNA

While browsing the current issue of Nucleic Acids Research (May 2009, Vol. 37, No. 8), I noticed, surprisingly, the article titled Reconstitution of ‘floral quartets’ in vitro involving class B and class E floral homeotic proteins by Rainer Melzer and Günter Theißen from Jena, Germany.

From the title and abstract, I would not expect 3DNA could play a role here. Yet in section Modeling a floral quartet, the authors used fiber to building a B-DNA model of floral quartets, using sequencebetween the XhoI and XbaI sites of probe A, then analyze the structure, and modified the resultant file 'bp_step.par' accordingly to their specifications, and rebuild a customized DNA model. In the following paragraph, the authors used the analyze/rebuild pair again for the CArG box bound to serum response factor.

This is the intended usage of the 3DNA software package I have had in mind from its initial design. I emphasized such versatile, integrated approach unique with 3DNA in our 2008 Nature Protocols paper. As time goes by, I am confident that more people will begin to appreciate this approach.

Friday, May 15, 2009

PDB ATOM coordinates record

PDB format is one of the standard formats for biological macromolecular structures (proteins, DNA/RNA, their complexes, etc). It came into existence when the initial Brookhaven PDB was established in 1971. Over the years, PDBML and mmCIF have been proposed as substitutes to allow for more flexibilities, yet PDB format is still the mostly commonly used one. Software dealing with PDB structures each has its own parser, at least for the Coordinate Section (especially the ATOM and HETATM records).

Overall, PDB format is simple and very well documented. The simplicity lies just in its 'rigidity', in FORTRAN 77 style. The ATOM/HETATM record description is excerpted below for easy reference:


COLUMNS        DATA  TYPE    FIELD        DEFINITION
-------------------------------------------------------------------------------------
1 -  6         Record name   "ATOM  "
7 - 11         Integer       serial       Atom  serial number.
13 - 16        Atom          name         Atom name.
17             Character     altLoc       Alternate location indicator.
18 - 20        Residue name  resName      Residue name.
22             Character     chainID      Chain identifier.
23 - 26        Integer       resSeq       Residue sequence number.
27             AChar         iCode        Code for insertion of residues.
31 - 38        Real(8.3)     x            Orthogonal coordinates for X in Angstroms.
39 - 46        Real(8.3)     y            Orthogonal coordinates for Y in Angstroms.
47 - 54        Real(8.3)     z            Orthogonal coordinates for Z in Angstroms.
55 - 60        Real(6.2)     occupancy    Occupancy.
61 - 66        Real(6.2)     tempFactor   Temperature  factor.
77 - 78        LString(2)    element      Element symbol, right-justified.
79 - 80        LString(2)    charge       Charge  on the atom.

It won't take much time/lines in a script language such as Perl/Python/Ruby etc to extract specific information one is interested in, e.g., atomic coordinates. However, there are some subtleties that are beyond simple script parsers. On top of that, one needs to understand that not all self-claimed PDB files are standard compliant.

More specifically, a decent PDB format parser must take the following into considerations:

The four-character atom name specified in columns 13 to 16. Each biological molecule has a convention in naming atoms. For example, the two H-bonds of the A-T pair are between " N1 " (A) to " N3 " (T), and " N6 " (A) to " O4 " (T). In this regard, it worths noting that babel/openbabel converted PDB files do not follow such naming convention.
The one-character alternate location indicator (altLoc) in column 17
The one-character residue insertion code (iCode) in column 27.
Others details to follow ...

If you are serious about your PDB format parser, there is a very simple thing to do: run it against all the NDB entries if you are working on nucleic acid structures (that's what I did for 3DNA), and the whole PDB entries if you are interested in protein structures. If it does not crash and does what it has been designed for, then congratulate yourself!

Saturday, May 9, 2009

Running Windows on Ubuntu Linux with VirtualBox

Ever since I switched from MS DOS to Unix and then Linux, I have never felt the desire to move back to the Windows world. Yet in the real world, MS Windows is an operating system (OS) that you simply can't avoid. For example, I compiled 3DNA v2.0 in Cygwin and MinGW, both in Windows. For such an one time process, my dual boot-able Ubuntu and Vista has worked just fine. However, when writing an article in a collaborative setting, access to Word has becomes a must and switching back and forth between Linux and Windows is really a pain -- not just time consuming, but also the inconvenience in communicating files between the two OSes. OpenOffice is fine for personal, occasional use, but I have experienced some (subtle) compatibility issues with Word, especially with EndNote support for citations and bibliography.

In the past couple of days, I installed Sun VirtualBox 2.2.2 onto my Ubuntu 9.04 with the following settings:

Windows XP (MS Office 2003, EndNote X2)
512 MB base memory
12 MB video memory
10 GB hard disk
With Guest Additions installed

I have played around for a little while, and everything seems to work just fine. Now I have full access to MS Office (Word, PowerPoint etc) and EndNote. I really like the feature of "Shared Folders": it allows me to directly access files in my host Linux box from within the guest Windows XP.

The VirtualBox UserManual.pdf is pretty well-written: after reading carefully the first four chapters, I did not have much trouble in getting the system up and running. I also referenced the following two blog posts:

Run Windows in a virtual machine using Ubuntu and Virtualbox

Virtualization in Linux: A Review of Four Software Choices

Wednesday, May 6, 2009

PhD pages on DNA base-stacking interactions

I was recently approached by Dr. Victor Zhurkin, a well-known computational biologist at the NIH, asking for some details about the sigma/pi charges used in my JMB publication on DNA base-stacking interactions. It was a paper based on my PhD thesis with Dr. Chris Hunter at Sheffield University. Ever since its publication a decade ago, this is the first time I have been asked questions about it (Dr. Hunter was the corresponding author).

This request was both a nice surprise, and at the same time, a challenge since I do not have the original files at hand. Over the years, I have never been a (big) fan of energy calculations. So unlike my SCHNAaP/SCHNArP papers for which I do keep the original source code and data files to reproduce the results published back-to-back in JMB, I do not have direct access to the original files for del Re (sigma) and Huckel (pi) charges used in my base-stacking interactions paper. I did find the HP 60 meter Data Cartridge I took from Sheffield which should contain all the details of my PhD work, but it is hard these days to find a machine to read it. At the meantime, I found a hard copy of my PhD thesis, which contains further details. So I used my digital camera, took pictures of 11 pages that are relevant, and made it available online in a tarball file.

The key feature of my DNA-stacking interaction work is the distributed charge model for the aromatic bases. This charge model was based on the seminal work of "The Nature of Pi-Pi Interactions" by Hunter and Sanders published in JACS in 1990. This simple distributed charge model catches the aromaticity concept nicely, making it easily understandable to bench chemists and biologist. As I have just checked it out, this 1990 JACS paper has been cited 2280+ times, according to Web of Science. My 1997 JMB DNA-stacking interaction paper has received 78 citations -- certainly not comparable to the JACS paper, but overall not bad at all. (Google Scholar gives much smaller citation numbers, for unknown reasons).

Looking back, the DNA base-stacking project was the main theme of my PhD work. It was the close collaboration with El Hassan in Dr. Chris Calladine's group at Cambridge University that introduced me to the elegant CEHS scheme, and based on which I created the SCHNAaP/SCHNArP software programs. The mathematical rigor of the CEHS scheme, i.e., its complete reversibility with the analysis/rebuilding of dinucleotides, put my theoretical energy calculation on a solid basis for comparison with oligonucleotide X-ray crystal structures. The "detour" to SCHNAaP/SCHNArP reflected my personal interest, and was the starting point that later led to the resolution of the long-standing discrepancies among nucleic acid conformational analyses, the establishment of standard base reference frame, and the creation of the now popular 3DNA software package.