Monday, December 13, 2010

Extract images from PDF files using 'pdfimages'

Once in a while, I need to extract an image (or a portion thereof) from a PDF file. Usually, I open the PDF file using 'preview' or Adobe 'Acrobat Reader', and take a screenshot which is then cropped to the desired slice. This manual method "works", albeit a bit tedious.

Recently, I came across the handy command-line utility program 'pdfimages' which allows for automatic extraction of all images from a PDF file. The basic usage is simply:
pdfimages   yourfile.pdf   prefix
This will extract all images contained in yourfile.pdf to files prefix-001.ppm, prefix-002.ppm etc. With option "-j", images in DCT format are saved as JPEG files.

Sunday, December 5, 2010

Nice blog on English writing by Lynn Gaertner-Johnston

A while back, I was somehow curious to figure out the differences between entitled vs. titled. A colleague referred me to a blog post on this topic by Lynn Gaertner-Johnston. After reading the post, it became clear that over the years I had made the same mistake numerous times. I then corrected nearly all occurrences of "entitled" to "titled" in my blog and 3DNA forum posts. Ever since, I have been following Lynn Gaertner-Johnston's blog, titled "Business Writing", on a regular basis, and found it helpful to improve my English writing skills.

Sunday, November 28, 2010

Under-citation of method papers?

Recently in the CCP4BB, there is an interesting thread with extensive discussions on "Citations in supplementary material". The original poster refers to the recent Acta Cryst D editorial with the same title in which the authors highlights the issue of under-citation to papers published in the International Union of Crystallography (IUCr) journals.

The main point is that method papers are more likely to be cited in the supplementary materials only, which are not indexed by PubMed, Scopus, Web of Science or Google Scholar etc. As a result, they are statistically undercounted, and "Journals and scientists that focus on publishing methodologically oriented papers are particularly affected." Specifically, through a survey of articles on protein or nucleic acid structure determination published in Cell, Nature, Science, and PNAS in 2009, the authors found that "almost half of all references to publications in IUCr journals end up being published in the supplementary material only."

The findings of the editorial resonate with my observations, and I cannot agree more with the authors that "in the end, methods need to be continuously developed and refined in order to ensure progress."

On the other end of the spectrum, some highly influential method papers are heavily cited. As an extreme case, the large number of citations to the 2008 paper "A short history of SHELX" by George Sheldrick helps rocket up the impact factor of Acta Crystallographica A by 20-fold to 49.9 this year!

Sunday, November 21, 2010

Belorussian translation of 3DNA webpages

Recently, I communicated with Paul Bukhovko on the translation of 3DNA webpages http://rutchem.rutgers.edu/~xiangjun/3DNA/ into Belorussian. As the author of the original website, referred to as http://rutchem.rutgers.edu/~olson/3DNA/ (which is simply a soft link to the above URL) in the 2003 3DNA NAR paper, I was very surprisingly pleased when Paul asked for permission to perform the translation, which I gladly granted.

Regarding the process, Paul commented:
Was a pleasure to translate this page! It's kinda fresh and related to my professional interests, so I thought - why not, if the author allows to do so.

The translated page is at URL: http://www.movavi.com/opensource/3DNA-be. Interestingly, when I used Google Translate to convert the Belorussian version back to English, the outcome is pretty readable. In contrast, when the original English version is directly translated to Chinese, the result is beyond recognition!

Sunday, November 14, 2010

Proper labeling of O1P and O2P atoms in a phosphate group

Recently, a question on the 3DNA o1p_o2p utility program in the forum led me to reflect on the proper labeling of O1P/O2P atoms in a phosphate group. As is well-known, in DNA/RNA structures, the phosphate group (see figure below left) connected two neighboring nucleosides.

The two nonbridging oxygen atoms of the phosphate group (the horizontal Os in the O-P=O line, left) are named O1P and O2P in PDB files (see also figure to the right). Stereochemically, O1P and O2P are also designated as pro-R and pro-S oxygens, respectively.

Presumably, the structural files in the PDB and NDB databases should be consistent and follow the standard nomenclature. In practice, however, some entries in the NDB had mislabeled O1P/O2P atoms (e.g., adh026). I first noticed this issue when I superposed the A-DNA adh026 to its 3DNA rebuilt version with the sugar-phosphate backbone. I observed an unreasonably large RMSD only for the octamer adh026, while the RMSDs were much smaller (as expected) for the B-DNA dodecamer bdl084 and the 146-bp nucleosomal DNA in pd0001 (see $X3DNA/examples/analyze_rebuild distributed with 3DNA v2.0). Since the standard building blocks in 3DNA were applied consistently, I traced the cause of the large RMSD problem to the PDB file of adh026 itself, and finally identified it was actually due to the mislabeling of the O1P/O2P atoms.

The utility program o1p_o2p was written specifically for the purpose of checking if the O1P/O2P atoms are properly labeled in a PDB file. In a phosphate group, if O1P/O2P are correctly labeled, then following O1P-->O2P-->O5' in a right-handed sense would point the thumb in the direction of O3' (see the figure up right). As always, how it actually works is best illustrated with an example. Shown below, the second phosphate in adh026 is used (as distributed with 3DNA), with GNU octave script. Here the O1P/O2P atoms are mislabeled since direction has a negative value. In contrast, for a properly labeled phosphate group, direction should be positive.

#ATOM      6  O3*   G A   1       8.396  -3.995  -1.948  1.00 30.86           O  
#ATOM     20  P     G A   2       8.163  -3.069  -0.619  1.00 32.38           P  
#ATOM     21  O1P   G A   2       7.401  -1.917  -1.218  1.00 32.09           O  
#ATOM     22  O2P   G A   2       7.280  -3.934   0.195  1.00 34.05           O  
#ATOM     23  O5*   G A   2       9.600  -2.800  -0.121  1.00 29.41           O  

P   = [8.163  -3.069  -0.619]
O1P = [7.401  -1.917  -1.218]
O2P = [7.280  -3.934   0.195]
O3  = [8.396  -3.995  -1.948]
O5  = [9.600  -2.800  -0.121]

O1P_to_O2P = O2P - O1P                    # -0.12100  -2.01700   1.41300
O2P_to_O5 = O5 - O2P                      #  2.32000   1.13400  -0.31600
O1P_O2P_O5 = cross(O1P_to_O2P, O2P_to_O5) # -0.96497   3.23992   4.54223
P_to_O3 = O3 - P                          #  0.23300  -0.92600  -1.32900

direction = dot(O1P_O2P_O5, P_to_O3)    # -9.2616 < 0: O1P/O2P mislabeld

The O1P/O2P labeling issue is just a little detail I came cross while developing 3DNA. Nevertheless, it serves as an excellent example of the subtleties subtitles that should be taken care of in scientific programming.

Please note that as of 2008, in the remediated PDB/NDB entry adh026, the mislabeled O1P/O2P pair has been correct. More generally, O1P/O2P atoms have now been renamed as OP1/OP2, respectively. 3DNA v2.0 takes care of such naming changes internally; for generated PDB files, however, 3DNA still adopts the conventional O1P/O2P labeling.

Saturday, November 6, 2010

Transparency in the peer-review process of scientific papers

In the Nov. 4, 2010 issue of Nature, there is an interesting Comment, titled "Transparency showcases strength of peer review", by Bernd Pulverer, head of scientific publications at the European Molecular Biology Organization and chief editor of The EMBO Journal. In this article, Pulverer "reflects on his experience at The EMBO Journal of publishing referees’ reports, authors’ responses and editors’ comments alongside papers."

The peer-review process of scientific articles has traditionally been a "black box": (anonymous) reviewers' reports, editors' comments, and authors' responses – extremely valuable information in shaping the final form of published papers – are all hidden from public view. In the Internet era, technology (e.g., online space) is no longer an issue. Now The EMBO Journal has led the way, and "the experience has been overwhelmingly positive." Hopefully, other leading journals (e.g., Nature and Science) would follow the example. Afterall, making the peer-review process transparent is an excellent mean to increase the accountability of science and scientific publications.

Overall, this article is well-written, succinct and logical, and it touches an important topic in scientific publication. Over the past few days, I have read several "Review Process Files" accompanying papers I am interested in, e.g., "Recognition of the amber UAG stop codon by release factor RF1", and found them highly revealing.

Saturday, October 30, 2010

Publication of scientific programming code

Recently on the Nature website, I read with great interest a news article, titled "Publish your computer code: it is good enough", by Nick Barnes, a professional software engineer:
Freely provided working code — whatever its quality — improves programming and enables others to engage with your research
Clearly the author knows the "trade secret" in scientific programming. He lists several common reasons why scientists are reluctant to share their source code, and then provides his responses:
  1. The code is low quality — "software in all trades is written to be good enough for the job intended". All software has bugs. Sharing code would help improve the code itself and advance the research field.
  2. Not a common practice — this is going to change or is already changing.
  3. Demand for support — "Nobody is entitled to demand technical support for freely provided code."
  4. Intellectual property issue — The most value part "lies in your expertise", code not backed by skilled experts is called abandonware. (I cannot agree more with this point.)
  5. Polishing code takes time/effort — not need to, just supply, as supplementary materials in a website, the original code used in your publication.
As is evident from the many comments, this assay is well echoed by the community. As an active computational scientist for over a decade, I share mostly the same opinions. Essentially, the transparency of source code is to ensure repeatability of scientific publications. In the field of computational biology (bioinformatics), it is virtually impossible to reproduce exactly a published figure/table without direct access to details, including the source code.

Saturday, October 23, 2010

Chi (χ) torsion angle characterizes base/sugar relative orientation

Except for pseudouridine, a nucleoside in DNA/RNA contains an N-glycosidic bond that connects the base to the sugar. The chi (χ) torsion angle, which characterizes the relative base/sugar orientation, is defined by O4'-C1'-N1-C2 for pyrimidines (C, T and U), and O4'-C1'-N9-C4 for purines (A and G).

Normally (as in A- and B-form DNA/RNA duplex), χ falls into the ranges of +90° to +180°;  –90° to –180° (or 180° to 270°), corresponding to the anti conformation (Figure below, top). Occasionally, χ has values in the range of –90° to +90°, referring to the syn conformation (Figure below, bottom). Note that in left-handed Z-DNA with CG repeating sequence, the purine G is in syn conformation whilst the pyrimidine C is anti.




Presumably, the χ-related anti/syn conformation is a very basic and simple concept. In essence, though, the N-glycosidic bond and the corresponding χ torsion angle illustrate that the base and sugar are two separate entities, i.e. there is an internal degree of freedom between them. In this respect, it is worth noting that the Leontis-Westhod sugar edge for base-pair classification corresponds to the anti form only. When a base is flipped over into the syn conformation,  the "sugar edge", defined in connection with the minor (shallow) groove side of a nitrogenous bases, simply does not exist.

Base-flipping (anti/syn conformation switch) is one of the factors associated with the two possible relative orientations in a base pair, characterized explicitly in 3DNA as of type M+N or M–N since the 2003 NAR paper (Figure 2, linked below). I reemphasized this distinction in our 2010 GpU dinucleotide platform paper (in particular, see supplementary Figure S2). Unfortunately, this subtle (but crucial, in my opinion) point has never been taken seriously (or at all) by the RNA community, even with 3DNA's wide adoption. However, as people know 3DNA deeper/better and take RNA base-pair classification more rigorously, I have no doubt they will begin to appreciate the simplicity of this explicit distinction and the resultant full quantification of each and every possible base pair using standard geometric parameters.




On a related issue, current versions of 3DNA (v1.5 and v2.0) output only the χ torsion angle without providing the anti/syn classification. This defect, and many others, will hopefully be rectified in future releases of 3DNA.

Friday, October 15, 2010

Improving the design of existing code by refactoring

Another software engineering book I read recently is "Refactoring: Improving the Design of Existing Code" by Martin Fowler. According to the author,
Refactoring is the process of changing a software system in such a way that it does not alter the external behavior of the code yet improve its internal structure. It is a disciplined way to clean up code that minimize the chances of introducing bugs. In essence when you refractor you are improving the design of the code after it has been written.

The book is practical in nature; it not just explains the principles but provides a detailed account of over 70 commonly used refactorings. As vividly explained by the author in the first paragraph of Chapter 1, "Refactoring, a First Example", "it is with examples that I [the author] can see what is going on." This approach fits my style perfectly: to really understand a topic, I always find a worked example far more effective than general principles. While the examples in the book are illustrated in Java, the basic ideas can be applied well to other object-oriented or even procedural languages (such as C).

Over the years since I left Dr. Olson's laboratory at Rutgers, I have been maintaining and continuously refining 3DNA. I have taken each user's question as an opportunity to fix bugs and improve its design, thus making the code more robust and efficient. The majority of my efforts, as I now realize, is "refactoring" existing 3DNA code to make it easier to maintain and extend. Reading through this book gives me the chance to put my practices into a broader context. I will surely take advantage of some refactoring examples from the book for further refinements of 3DNA.

Wednesday, October 13, 2010

NSMB editorial: "Go figure"

In the October 2010 issue of Nature Structural & Molecular Biology (NSMB, Vol. 17, No. 10) there is another interesting one-page editorial, titled "Go figure", which provides tips on how to make a scientific figure that may worth 1000 words:
A picture may be worth a thousand words, but ensuring that those words make sense is important, especially in the context of a scientific figure. Here are some tips for making your figures count.
A recap of the tips is given below; by and large, they all follow conventional wisdom:
  • General considerations: Each figure should make just one point and be self-explanatory.
  • See guidelines. "At all stages, the figures should be clear and legible."
  • How many figures? The figures should complement the Results section, and be included only necessary.
  • How many panels? Better only one; multiple panels "should be logically connected."
  • What’s in a label? Keep it succinct, but make the figure self-explanatory.
  • Getting colorful. Use color wisely and constantly.
  • A legendary figure. The figure legend should concise and informative.
  • A model paper. Better have a figure (at the end) of the final model that conveys "the big picture". Honestly, I do not quite get this point.
As pointed out by the author, "These are just a few guidelines and suggestions for handling figures." Overall, "simplicity rules in scientific figures, as in life." I guess no one would argue with such general advices. However, it would be even more helpful to illustrate such points with concrete examples (I know that seems to be beyond the scope of a one-page editorial).

Thursday, September 30, 2010

Further details of the DNA story revealed by Crick's lost correspondence

In the September 30 issue of Nature (Vol. 467, pp519-524), there is an interesting account of "The lost correspondence of Francis Crick" by Gann & Witkowski. The newly found letters, mostly between Crick and Wilkins, unveil further background information on the exciting DNA story. As the authors put it, "Strained relationships and vivid personalities leap off the pages."

I read Watson's "The Double Helix" book a while ago, and overall I am quite familiar with the DNA story. Still, I found this account fascinating: it provides a "CAST LIST" in "The search for the structure of DNA" with photos (p521); and it succinctly summarizes the relationships among the key players. In science, no other story shows more dramatically the collaborative and competitive nature among scientists working on similar projects.


Franklin’s X-ray diffraction photograph 51 of B-form DNA (see figure above, from Wikipedia), with its unambiguous evidence that DNA was helical, proved crucial for Watson and Crick to determine the structure of DNA. Indeed, the Watson-Crick DNA model corresponds to the B-form DNA, with its base-pairs in the middle, parallel to each other and perpendicular to the linear helical axis.

From this Nature account, however, I noticed for the first time a subtle detail: when B-form DNA photograph 51 was shown to Watson by Wilkins in early 1953, Franklin also already had the A-form DNA diffraction pattern. According to the authors,

It was the A-structure diffraction pattern that had led Franklin away from believing that DNA, in that form at least, was helical, despite her already having produced the most persuasive helical pictures of the B structure — including photograph 51. The crystalline DNA gave better quality diffraction data, more suited to her painstaking, quantita- tive approach, and so she focused on the A form during 1952. It was at this time that she and Gosling made a handwritten, black edged funeral card announcing the death of “DNA Helix(crystalline)”.

When Crick had the opportunity to look the A-form DNA diffraction picture, on 5 June 1953, he wrote (to Wilkins):
This is the first time I have had an opportunity for a detailed study of the picture of Structure A, and I must say I am glad I didn’t see it earlier, as it would have worried me considerably.

I am reading "Blink: The Power of Thinking Without Thinking", a book by Malcolm Gladwell. The above case serves as a vivid example.

Sunday, September 19, 2010

New NDB entries have IDs start with the prefix NA?

Recently, while reading the article titled "Designing Triple Helical Fragments: The Crystal Structure of the Undecamer d(TGGCCTTAAGG) Mimicking T.AT Base Triplets" by Van Hecke (Crystal Growth & Design), I noticed the following:
The atomic coordinates and structure factors have been deposited in the Protein Data Bank41 and Nucleic Acid Database42 (PDB and NDB entry codes 3L1Q and NA0392, respectively).
The NDB id NA0392 reminded me of an email communication I had with Dr. Olson early this year when she told me of the id change of new NDB entries. Occupied with other issues, I did not pay much attention to this point until recently.

Over the years, NDB has established itself prominently as "a repository of three-dimensional structural information about nucleic acids". Traditionally, an NDB id has some associated "meanings", with noticeably exceptions, as discussed one year ago in my blog post "PDB id vs NDB id". Thus, it is not surprising that NDB decided to make changes in naming new ids. However, I cannot find any announcement on the id change in the NDB website; a Google search on "NDB id" did not uncover anything new either. Luckily, NDB provides search by release date ("Released Since"). After a few tries, I traced that the new id policy began to be implemented from around March 2010.

The new NDB id convention appears to be NA (presumably standing for for Nucleic Acid) followed by 4 digits. As more entries available, it is not that hard to imagine that the number of digits must be expanded to 5, 6, or even more. Naturally, PDB id, with a fixed 4-character length (up to now), is (far) more consistent than the NDB id is.

Sunday, September 12, 2010

"Code Complete", a practical handbook of software construction

Over the summer, I read quite a few books on C programming, and more generally on software construction. Among them, the book titled "Code Complete" (2nd edition) by Steve McConnell is the most comprehensive: with over 900 pages, it certainly serves as "a practical handbook of software construction".

I have been writting scientific software applications for over twenty years, using a variety of programming languages. Gradually, writing code per se is taking less time; nowadays by far the most time-consuming parts have increasingly become (1) to understand a scientific topic thoroughly in order to implement it in new code, and (2) to maintain/refine/adapt existing codebase to meet the needs of its user community (e.g., 3DNA). It is for the benefit of the later that I've found the "Code Complete" book very helpful.

As an example, following the book's recommendations (1) "Use boolean variables to simplify complicated tests" (pp301-2) and (2) "Simplify complicated tests with boolean function calls" (p359), I've easily improved the readability of some parts of my code.

Overall, the book is well-written and full of practical advices; I enjoyed reading it.

Sunday, September 5, 2010

Identification of C-H...N/O H-honds using 3DNA

Recently, I came across an interesting article by Kiliszek et al., titled "Atomic resolution structure of CAG RNA repeats: structural insights and implications for the trinucleotide repeat expansion diseases". In addition to its biological implications, this paper uses 3DNA to deal with non-canonical-base-pair-containing helical structures in a sensible way:
The helical parameters were calculated using 3DNA (29). Sequence-independent measures were used, based on vectors connecting the C1' atoms of the paired residues, to avoid computational artefacts arising from non-canonical base pairing.
Another significant point, which is the focus of this post, is the observation that "All the adenosines are in the anti-conformation and the only interaction within each A-A pair is a single C2-H2...N1 hydrogen bond." (Figure below) Given the 0.95 Å ultrahigh resolution of structure 3nj6/na0608, it is likely that this type of A-A is real.


The find_pair program in the currently distributed versions of 3DNA (v2.0 and before), however, does not identify this A-A pair (thus the corresponding NDB list of "Base Pair Step Parameters" is incomplete) for the following two reasons:
  1. No H-bond exists between N/O base atoms – currently a requirement for a base-pair.
  2. By default, only N/O atoms are used in defining H-bonds (see tag hb_atoms in file "misc_3dna.par"). Nevertheless, by adding C as a possible atom in forming H-bond, and manually editing find_pair generated input file to analyze, 3DNA structural parameters can be calculated as usual.
I have updated find_pair in 3DNA to identify such C-H...N/O H-bond automatically (see below for entry 3nj6). Upon further refinements and validations, future releases of 3DNA will have this functionality available.

    1   95  #    1 | ...1>A:...1_:[..G]G-----C[..C]:..10_:A<...2
    2   94  #    2 | ...1>A:...2_:[..G]G-----C[..C]:...9_:A<...2
    3   93  #    3 | ...1>A:...3_:[..C]C-----G[..G]:...8_:A<...2
    4   92  #    4 | ...1>A:...4_:[..A]A-**--A[..A]:...7_:A<...2
    5   91  #    5 | ...1>A:...5_:[..G]G-----C[..C]:...6_:A<...2
    6   90  #    6 | ...1>A:...6_:[..C]C-----G[..G]:...5_:A<...2
    7   89  #    7 | ...1>A:...7_:[..A]A-**--A[..A]:...4_:A<...2
    8   88  #    8 | ...1>A:...8_:[..G]G-----C[..C]:...3_:A<...2
    9   87  #    9 | ...1>A:...9_:[..C]C-----G[..G]:...2_:A<...2
   10   86  #   10 | ...1>A:..10_:[..C]C-----G[..G]:...1_:A<...2
##### Criteria: 4.00  0.00  15.00  2.50  65.00  4.50  7.50   [ O N C]
##### 2 non-Watson-Crick base-pairs, and 1 helix (0 isolated bps)
##### Helix #1 (10): 1 - 10

The Jmol paper: a paradigm shift in crystallographic visualization

From the Jmol mailing list, I noticed Prof. Robert Hanson's announcement titled "Jmol and crystallography: the paper". Finally, an "official" paper has been published on Jmol, an increasing popular molecular visualization tool.

While the paper is categorized under "teaching and education" and summarized as below:
Recent advances in molecular and crystallographic visualization methods are allowing instructors unprecedented opportunities to enhance student learning using virtual models within a familiar web-browser context. In step with these advances, the latest versions of the Jmol molecular visualization applet offer capabilities that hold potential for revolutionizing the way students learn about symmetry, uncertainty and the overall enterprise of molecular structure determination.
Jmol is certainly as useful for research as other commonly used molecular visualization tools, such as PyMOL and RasMol. Specifically, Jmol applets have become dominated in web-based applications/resources (including RCSB PDB). In fact, the HTML version of the paper itself makes extensive use of Jmol applet for interactive manipulation of over 20 figures.

Over the past couple of years, I have witnessed the many new features added to Jmol, driven by an active and friendly user community. I have been particularly impressed by Prof. Hanson's quick and to-the-point responses to users' requests for new features and bug reports, including to my request for Jmol's "Support of 'alchemy' format for rectangular schematic base-pair geometry". Overall, the paper presents a nice overview of some of Jmol's new functionality, and could become highly cited.

Reading through the paper, I was just a bit surprised that PyMOL is not mentioned. Nowadays, Jmol and PyMOL are apparently the top two molecular graphics software tools, with common and complementary functionality.

Saturday, August 21, 2010

Calculation of the chi (χ) torsion angle of pseudouridines in RNA structures

In RNA structures, chi (χ) is defined as the torsion angle about the N-glycosidic bond between the ribose sugar and the canonical A/C/G/U bases or their modified variants. Specifically, for pyrimidines (C and U), χ is defined by O4'-C1'-N1-C2; and for purines (A and G) by O4'-C1'-N9-C4. See the following figure from the IMB Jena Image Library:



Pseudouridine (5-ribosyluracil, PSU) was the first identified modified nucleoside in RNA and is the most abundant. PSU is unique in that it has a C-glycosidic bond instead of the N-glycosidic bond common to all other nucleosides, canonical or modified. It thus poses a problem as to how to calculate the χ torsion angle: should it be O4'-C1'-C5-C4 reflecting the actual glycosidic bond connection, or should the conventional definition O4'-C1'-N1-C2 still be applied literally? As a concrete example, the figure below shows the (slightly) different numerical values, as given by the two definitions, for PSU 6 on chain A of PDB/NDB entry 3cgp/ar0093.


Needless to say, definition of the χ torsion angle of PSU in RNA structures is a very subtle/minor point, and I am not aware of any discussion on this issue in literature (I'd appreciate your sharing of related information in the comment). In 3DNA, PSU is identified explicitly, and χ is defined by O4'-C1'-C5-C4. In NDB and a couple of other tools (I've played with), χ for PSU is defined by O4'-C1'-N1-C2. Again using 3cgp/ar0093 (figure above) as an example, 3DNA gives -162.7°, whilst NDB gives -163.9°. Hopefully, this post will help clarify a confusion for those who care about such little details.

Saturday, August 14, 2010

PDB format, how many variants are there?

In the field of structural biology of macromolecules (proteins and nucleic acids), the PDB format is still the primary one, even though it has shown its age/limitations, and mmCIF and PDBML have been introduced as alternatives.

In 3DNA, PDB is currently the only accepted format for the analysis routines (analyze/cehs), the primary format for rebuilt structures (rebuild/fiber), and various other output structure files (e.g., bestpairs.pdb etc.). By and large, 3DNA follows the PDB format documented at the RCSB website. In my maintaining and supporting of 3DNA and the forum over the years , I have come across other variants of the PDB format. Software packages customize PDB in subtle, sometimes substantial, ways that cause problems with 3DNA. Some examples follow:
  • Recently, a user tried to analyze an RNA duplex using w3DNA, but found that "the server seems not to recognize the file". It turned out that the so-called PDB format does not have residue name/number and chain id aligned in columns properly. For example, among other issues, the chain id in the problematic PDB is not at column #22 as specified by the standard.
  • A couple of months back, another user asked how to handle AMBER-introduced ribonucleotide abreviations (RA/RC/RG/RU). Compared to the above "PDB" variant, the AMBER PDB is well-behavored, and can be easily recognized by 3DNA after introducing corresponding one letter matches in file baselist.dat. Interestingly, a couple of years ago, DA/DC/DG/DT was introduced into the PDB format to distinguish deoxyribonucleotide (DNA) from ribonucleotide A/C/G/U (RNA). The AMBER convention, by using RA/RC/RG/RU for RNA, appears to be more symmetric.
  • From my (limited) experience in helping a 3DNA user several years ago, I know that CHARMM uses three-letter residue names for DNA (e.g., ADE/CYT/GUA/THY). There is actually a short Perl utility program (x3dna2charmm_pdb) in 3DNA for converting 3DNA generated PDB file to that recognized by CHARMM.
  • As noted in another post, "PDB ATOM coordinates record", Babel/Open Babel converted PDB files (mostly of small molecules) do not follow atom naming conventions.
As widely used and standard as PDB format is, there are still many variants in common use. Thus, when you meet a problem related to a PDB file, it helps to check if it conforms to the standard. With 3DNA, I've always tried to incorporate as many PDB variants as possible, so users can run 3DNA on them transparently.

Saturday, August 7, 2010

Calculation of DNA bending angle

Once in a while, I receive the question (via email or on the 3DNA forum) on how to calculate DNA bending angle, mostly associated with DNA-protein complexes. This question certainly qualifies as an FAQ; Indeed, I did consider writing an entry on it in the 3DNA website. However, I've hesitated to do so because the topic is more subtle and complicated than it appears.

On its face, an angle is defined by two vectors; let's call them a and b, and if each is normalized, then the angle (in degrees) between them is: acos(dot(a, b)) * 180/pi. Geometrically, after moving the tails of the vectors into the same position (e.g., origin), the heads would normally define a plane, unless a and b are strictly parallel (0°) or anti-parallel (180°).

DNA structures are three-dimensional, normally far more complicated than a single number can quantify. The concept of DNA bending angle, as I understand it, is only applicable to DNA structures with two relatively straight fragments (as in CAP-DNA complexes). Under such situations, one can fit a least-squares (ls) linear helical axis to each of the two fragments, and calculate the angle between them. Towards this end, 3DNA outputs the following section when it judges that the input structure is not strongly curved. Using 355d/bdl084, which is distributed with 3DNA, as an example:
Global linear helical axis defined by equivalent C1' and RN9/YN1 atom pairs                                    
Deviation from regular linear helix: 3.30(0.52)
Helix:    -0.127  -0.275  -0.953
HETATM 9998  XS    X X 999      17.536  25.713  25.665
HETATM 9999  XE    X X 999      12.911  15.677  -9.080
Average and standard deviation of helix radius:
P: 9.42(0.82), O4': 6.37(0.85),  C1': 5.85(0.86)
Where the Helix: line gives the normalized vector along the "best-fit" helical axis. The two HETATM records provides the two end points on the helix, and they are directly related to the Helix: line by a simple equation. Following the above example, we have (Octave/Matlab code):

XE = [12.911 15.677 -9.080];
XS = [17.536 25.713 25.665];

dd = XE - XS
% -4.6250 -10.0360 -34.7450

Helix = dd / norm(dd)
% -0.12685 -0.27526 -0.95296 ==> [-0.127 -0.275 -0.953]

With the two HETATM records, one can easily add them into the original PDB file to display the helical axis using a molecular graphics programs (e.g., RasMol, Jmol, PyMol). Moreover, the two helix vectors can be used to reorient the original PDB structure into a view so that one helical fragment lies along the x-axis, and the other in the xy-plane. As documented in detail in recipes #4 on "Automatic identification of double-helical regions in a DNA–RNA junction" (2008 3DNA Nature Protocols paper), "The chosen view allows for easy visualization and protractor measurement of the overall bending angle between the two relatively straight helices."

The following points are well worth noting:
  • The ls fitting procedure used in 3DNA follows SCHNAaP, which was based on the algorithm in the well-known NewHelix program, maintained by Dr. Richard Dickerson upto the 1990s. While fitting a global linear helical axis to strongly curved DNA structures clearly makes no much sense with derived parameters (NewHelix itself has been replaced by FreeHelix, also from Dickerson), I do believe it is meaningful to fit a linear helix with relatively straight DNA fragment. That's why I have kept this functionality in SCHNArP and 3DNA; 3DNA bending angle calculation serves as an example illustrating the point it provides an "intuitive" way for biologists to understand how the bending angle is calculated; it can actually be measured directly.
  • Instead of directly ls-fitting a linear helical axis with 3DNA, one can alternatively superimpose a regular fiber model into the DNA fragment, and then derive the straight helical axis from the fitted coordinates. The two approaches normally gives slightly different numerical values, as would be expected.
  • Overall, bending angle is (at most) an approximate measure of DNA curvature. In my opinion, the concept is only applicable for comparing a set of structures, each with two relatively straight fragments. Even in such cases, the relative spatial relationship between two helical fragments is more complicated than a simple (bending) angle could quantify. Specially, it is not meaningful to make a selling point in slight differences of DNA bending angles.

Friday, August 6, 2010

Genomic code for nucleosome positioning?

Over the past few years, the research on nucleosome positioning has been a hot topic: Segal et al., through a series of high-profile publications, have proposed that "the intrinsic nucleosome sequence preferences contribute substantially to nucleosome organization and chromatin function in vivo" (i.e., there exists a "genomic code") and developed bioinformatics tools to predict patterns of nucleosome positions based on genomic DNA sequence. However, as is clear from the title of a paper by Zhang et al., "Intrinsic histone-DNA interactions are not the major determinant of nucleosome positions in vivo", it is highly debatable if the nucleosome sequence preferences is strong enough to be called a "code", in the canonical sense as for the genetic code.

While I am (generally) aware of the controversy about the "genomic code", my understanding of the issue has become much clearer after reading the correspondences between both sides published in the 17(8), August 2010 issue of Nature Structural & Molecular Biology [NSMB]:
  • "Nucleosome sequence preferences influence in vivo nucleosome organization" (pp918 - 920) by Kaplan … and Segal
  • "Evidence against a genomic code for nucleosome positioning" Reply to “Nucleosome sequence preferences influence in vivo nucleosome organization” (pp920 - 923) by Zhang … and Struhl
  • "A preoccupied position on nucleosomes" (p923) by Pugh
Reading carefully through the text, it is clear how important the little details could be: 20-bp vs 40-bp windows? How significant the preference should be to be claimed as a "code"? Pugh clarifies another source of confusion, in the terminology of nucleosome ‘occupancy’ vs ‘positioning’. In my understanding, the dispute on such details itself speaks strongly against the existence of "genomic code".

It should be noted that early work by Drew & Travers (JMB85) and Satchwell et al. (JMB86) provided evidence that the rotational positioning of DNA about the histone octamer is regulated by DNA sequence-dependent bending preferences. Goodsell & Dickerson quantified the Satchwell trinucleotide model by assigning a set of roll angles to predict DNA bending from base sequence. This bending model, along with five others, was implemented by SCHNArP for building approximate 3D DNA structures (with fixed values for slide, twist etc other parameters). However, this functionality is dropped out from 3DNA, simply because of the arbitrariness/subtleties in picking up the parameters.

Sunday, August 1, 2010

Principle axes of inertia as a mean to reorient a molecule

In structural biology, it is often helpful to reorient a molecule in a specific view for easy visualization and comparison. With no a priori knowledge of a feature, the (objective) principle axes of inertia can be used as a mean to automatically reset a structure into its most extended orientation (and two other orthogonal views). In 3DNA, rotate_mol has such a functionality which is explored by blocview to generate the characteristic, simplified molecular images adopted by the RCSB PDB and NDB.

Recently, the thread (initiated by Dr. Pascal Auffinger) at the 3DNA forum titled "rotate_mol ambiguous orientations" has led me to think the topic deeper. As always, the issue is best illustrated with an example. The file 355d_A6_T19.pdb contains the coordinates of the base pair A6-T19 of PDB entry 355d. Let xyz be a variable in Octave (Matlab) containing the coordinates of all atoms in file 355d_A6_T19.pdb, the following m-file provides the detailed procedure:
xyz = [
13.954 12.080 15.071
13.092 11.246 15.953
14.769 11.453 14.020
13.056 13.202 14.406
12.184 13.965 15.222
11.308 14.846 14.376
12.033 15.991 13.896
10.654 14.188 13.168
9.262 14.480 13.256
11.373 14.820 11.974
11.810 16.185 12.511
13.051 16.742 11.953
14.253 16.097 11.802
15.197 16.853 11.305
14.585 18.084 11.106
15.066 19.322 10.605
16.329 19.520 10.209
14.198 20.357 10.535
12.938 20.162 10.946
12.370 19.048 11.444
13.259 18.034 11.497
14.865 29.619 5.710
14.172 30.844 5.281
16.107 29.695 6.515
13.846 28.701 6.510
12.684 28.174 5.864
11.909 27.311 6.826
12.626 26.090 7.129
11.584 27.960 8.173
10.189 27.833 8.395
12.409 27.166 9.184
12.544 25.808 8.521
13.716 24.979 8.867
13.496 23.688 9.310
12.381 23.202 9.467
14.631 22.971 9.551
15.943 23.388 9.393
16.852 22.613 9.632
16.107 24.759 8.938
17.488 25.309 8.745
14.995 25.475 8.711 ]

cov_mtx = cov(xyz)

% cov_mtx =
% 3.4094 2.3860 -1.4222
% 2.3860 33.7089 -15.7175
% -1.4222 -15.7175 7.8786

[v, lambda] = eig(cov_mtx)

% v =
% 0.094830 0.992835 -0.072709
% 0.419520 -0.106092 -0.901525
% 0.902779 -0.054989 0.426575

% lambda =
% 0.42535 0 0
% 0 3.23319 0
% 0 0 41.33836

x = v(:, 2);
y = v(:, 3);
z = cross(x, y);

% z =
% -0.094830
% -0.419520
% -0.902779

num = size(xyz, 1);
xyz2 = (xyz - ones(num, 1) * mean(xyz)) * [x y z]

% xyz2 =
% 1.148136 10.032849 -0.514731
% 0.332293 11.223635 -0.879359
% 2.081609 10.090517 0.619843
% .................................
% 2.277813 -4.170312 -0.501254
% 3.601180 -4.848890 -0.688714
% 1.110301 -4.831784 -0.491249
Among other subtitles, note that:
  1. lambda 3 has the largest value (41.33836), i.e., with the biggest variance, and its corresponding eigenvector is set as the vertical y-axis.
  2. lambda 2 shows intermediate variation (3.23319), and its corresponding eigenvector is set as the horizontal x-axis.
  3. The z-axis is derived as the cross-product of x- and y-axes to ensure a right-handed frame. Note that it is the opposite of eigenvector 1.
  4. The original xyz coordinates are moved to its geometric center [xyz - ones(num, 1) * mean(xyz)], and then projected onto the preferred frame to get the transformed coordinates (xyz2) with the molecule in the most extended view (vertical then horizontal).

Saturday, July 24, 2010

3DNA for the analysis of molecular dynamics simulations

To my surprise and delight, 3DNA has been increasingly used in the analysis of trajectories of molecular dynamics (MD) simulations of nucleic acid structures. While I have heard of CHARM/AMBER for a long time and know of the basic principles of MD simulations, I have never performed any simulations (yet). Over the years, I have been mostly interested in methods development and algorithms implementation for better understanding of nucleic acids structures, and have always used NDB/PDB entries for my tests/applications. Thus unlike Curves+, 3DNA lacks direct support for the analysis of MD trajectories.

From the MD-related questions I have received over emails and on the 3DNA forum, I sense that users do not have that much of a problem or inconvenience in applying 3DNA for the task. There is one issue that does stand out, however, i.e., the use of find_pair for each and every structure in the trajectories to prepare an input file that contains base-pairing information for analyze to calculate various parameters. As an example, a recent article by Ricci et al., "Molecular Dynamics of DNA: Comparison of Force Fields and Terminal Nucleotide Definitions" (J Phys Chem B. 2010 Jul 8. [Epub ahead of print]), says that:
It is noteworthy that, because the DNA structure endured strong deformations in these simulations, 3DNA could not perform the analysis of the angles for all the 5000 snapshots. Actually, for system 3 the calculation of the backbone parameters could be performed in only 59% of the snapshots, whereas in system 4 only 18% of the configurations allowed this calculation.
Dr. Netz, the corresponding author, confirmed to me that:
I don't think this is a 3DNA problem, but is rather an effect of the distortion caused by the problems in parameterization of the GROMOS forcefield.
Essentially, though, the problem lies in the coupled use of find_pair/analyze; for strongly deformed DNA structures along the simulation trajectories, some base pairs melt out, and are thus beyond the recognition of find_pair. On the other hand, one does not need to run find_pair for each structure; simply prepare a template input file for analyze, with the help of find_pair and probably some manual editing, then all the structures can be analyzed, no matter how distorted they are. Furthermore, saving the find_pair step should speed up the analysis of thousands of structures, typical of MD simulations.

In fact, 3DNA (from v1.5) has a Perl script named nmr_strs that implements the above idea in a preliminary form. The script is intended to serve as a starting point for the analysis of multiple similar structures, in an NMR ensemble or from MD simulations. See 3DNA wiki page on nmr_strs for details.

In the long run, a better integration between 3DNA and AMBER or other MD packages would solve this problem from the upstream, thus making users' life (downstream) much easier. Until that happens, I hope this post could help clarify the issue somewhat in this important application area of 3DNA. As always, I greatly value your comments.

Friday, July 16, 2010

PDB fingerprints make salient features of PDB entries obvious at a glance

In the past couple of days, I have followed with great interest the thread "PDBprints - salient, at-a-glance info about PDB entries" in CCP4BB. The idea behind PDBprints (short for PDB finger prints) is very simple: now with over 66 thousands entries, no one could possibly be intimately familiar with many of the structures in PDB. Yet "it would be very useful to have a way of obtaining at-a-glance information about the salient features of one or more PDB entries." In this scheme, key features (e.g., experimental method, species, protein/NA/ligand components, published or not) are represented by stylized icons called PDBlogos; a set of which organized in a specific order gives PDBprints.

The initial announcement, posted on Thursday by Gerard Kleywegt (the current Director of PDBe), has been well-received by the community. I am very impressed by the quality of the (over a dozen) feedbacks on icons coloring for presence/absence of a feature, and the proper icons for species etc. On the other hand, I am surprised by the quick and concrete response Gerard posted today, addressing the various issues.

Over the years, I have taken PDB/NDB mostly as data repositories; I regularly download (update) nucleic-acid containing structures for local analyses. Among the three wwPDB sites (RCSB, PDBe and PDBj), I've only visited the RCSB site on a regular basis. Now I find more features are available from PDBe, which I will browse more in the future.

Saturday, July 3, 2010

In 3DNA, analyze and rebuild are two sides of the same coin

Recently, I noticed the following two papers:
They both cite 3DNA, where the combined usage of its analyze/rebuild components has played a significant role.

When I first had access to the CEHS scheme, I was immediately attracted by its mathematical rigor which allows for complete reversibility in DNA structural analysis and model rebuilding. Historically, CEHS was the initial seed of the SCHNAaP/SCHNArP programs, and analyze/rebuild in 3DNA were directly derived from them.

Frequently, I think of analyze/rebuild as two sides of the same coin: starting from a nucleic acid structure (e.g., a DNA double helix), the analyze program gives a set of base-pair (propeller, buckle etc) and step (roll, slide etc) parameters. The structural parameters can be used to rebuild the structure, which is virtually identical in base geometry (i.e., without taking consideration of the sugar-phosphate backbone) to the original structure. Conversely, analyzing the rebuilt structure again gives exactly the same set of structural parameters describing the relative base geometry.

Reversibility is essentially a simple concept, nevertheless a very powerful one. Over the years, I am glad to see more people are taking advantage of this 3DNA unique feature in addressing real-world problems related to nucleic acid structures. I am confident to see more such applications.

Sunday, June 27, 2010

Get all torsion angles in a nucleic acid structure

In the field of nucleic acid structure analysis, a commonly calculated set of parameters is the torsion angles. Specifically, they include the main chain and chi (χ) torsion angles defined as follows:
  • α: O3'(i-1)-P-O5'-C5'
  • β: P-O5'-C5'-C4'
  • γ: O5'-C5'-C4'-C3'
  • δ: C5'-C4'-C3'-O3'
  • ε: C4'-C3'-O3'-P(i+1)
  • ζ: C3'-O3'-P(i+1)-O5'(i+1)
  • χ: for pyrimidines (Y, i.e., T, U, C): O4'-C1'-N1-C2; for purines (R, i.e., A, G): O4'-C1'-N9-C4
A related set of parameters characterizes the sugar conformation:
  • ν0: C4'-O4'-C1'-C2'
  • ν1: O4'-C1'-C2'-C3'
  • ν2: C1'-C2'-C3'-C4'
  • ν3: C2'-C3'-C4'-O4'
  • ν4: C3'-C4'-O4'-C1'
  • tm: amplitude of pseudorotation of the sugar ring
  • P: phase angle of pseudorotation of the sugar ring
These torsion angles are clearly defined and are readily available from various informatics programs. Not surprisingly, 3DNA also provides a complete set of DNA/RNA backbone torsions, calculated robustly and efficiently. The key is the "-s" option of find_pair" program, which is nevertheless little used, mostly because it is not the default.

Using the Haloarcula marismortui 50S large ribosomal subunit as an example (1jj2), the output file 1jj2.outs from the following command contains all the above mentioned main chain and sugar conformational parameters:
find_pair -s 1jj2.pdb stdout | analyze
Please see the Jena "Nucleic acid backbone parameters" website for a diagram (based on Saenger's book) defining the various backbone torsion angles.

Sunday, June 20, 2010

Subscribing to mailing lists in daily digest mode

I am currently on quite a few mailing lists that are relevant to my research topics in a broad sense, including Jmol, pdb-l and CCP4. Over the years, I've found it very handy to keep informed of a field by following in its (major) mailing list. However, I could easily get lost with so many posts each day on some active lists. So subscribing to the daily digest mode, provided by many mailing list management software, has become a norm. Thus, I would receive only one (or a few) email(s) from a mailing list. I could then easily browse through the subject lines to decide if to read further of a specific topic.

Among the three lists mentioned above, Jmol is quite active with several core contributors, most impressively from Prof. Robert Hanson. The PDB mailing list is of unbelievably low volume, with less than a post per day. CCP4 bullet board is overall the most well-organized; I am surprised by the knowledge-base from the posts there – it is a great resource in structural biology.

AMBER mailing list and Computational Chemistry List (CCL) are the other two lists I subscribed to before. However, I could not get into the daily digest mode; soon my mail box was flooded by many unrelated posts so I had to un-subscribe from them.

Journal impact factor and individual researcher h-index

The June 17, 2010 issue of Nature (Vol. 465, No. 7300) has extensive discussions on assessing the impact (influence/significance) of a journal or an individual researcher using quantitative metrics. Not surprisingly, there is no consensus. Over the past 20 years, the field of bibliometrics (scientometrics) has seen a ten-fold explosion in publications. In fact, the title of the editorial is "Assessing assessment"; it boils down to whether we should use a quantitative metric, a combination of several such metrics, or which metrics to choose among the so many possibilities.

Among the various view points, I agree more with David Pendlebury, Citation Analyst of Thomson Reuters. "No one enjoys being measured — unless he or she comes out on top." "Importantly, publication-based metrics provide an objective counterweight ... to bias of many kinds." "there are dangers [to] put too much faith in them. A quantitative profile should always be used to foster discussion, rather than to end it."

Among the many currently available metrics, it seems fair to say:
  • impact factor (IF), introduced in 1963, is the most important one to measure the impact of a journal. Nowadays, it is common to see the IF of a journal at its home page. For example, currently Nucleic Acids Research has an IF of 6.878. However, as emphasized by Anthony van Raan, director of the Centre for Science and Technology Studies at Leiden University in the Netherlands: "You should never use the journal impact factor to evaluate research performance for an article or for an individual — that is a mortal sin." Indeed, "In 2005, 89% of Nature’s impact factor was
    generated by 25% of the articles."
  • h-index, introduced in 2005 by Hirsch, is currently the most influential metric to quantify the productivity and impact of an individual researcher. An h-index is "defined as the number of papers with citation number ≥h". It "gives an estimate of the importance, significance, and broad impact of a scientist’s cumulative research contributions." I found it very informative to read the original, 4-page long PNAS paper to understand clearly why h-index is defined that way and to be aware of some of its caveats.
  • number of citations is undoubtedly the most objective metric to measure the impact of a publication.
Overall, it is easy to come up with a quantitative measure; the point is what that number really means. Along the line, it is of crucial importance to know exactly how a metric is calculated. No metric could be perfect; however, a metric, defined transparently and applied consistently, is more objective and convincing than other means.

Friday, June 11, 2010

3DNA citations from mainland Chinese researchers

Over the past couple of weeks, I have been quite surprised to notice the following four papers, by researchers from mainland China, that cite 3DNA:
For more ten years, this is the first time (to the best of my knowledge) that 3DNA has been cited by scientists from China, and it so happens that four come in a row. I am so happy to see that structural biology is shaping up in China. Surely, I would expect more 3DNA-citing papers from China to come in the following years.

block_atom, a little used 3DNA utility script

Following my previous post, "How is a base-pair rectangular block defined in 3DNA?", I feel it is in order to mention a tiny Perl script named block_atom that seems to have been little used by the 3DNA user community.

The basic idea of block_atom is to get an ALCHEMY data file which combines both atomic and block representations of nucleic-acid-containing structure in PDB format. The ALCHEMY file can then be displayed interactively using RasMol (v2.6.4, by its original author) (or Jmol).

Again, the point is best illustrated with an example. Here I am using 355d, the famous Drew-Dickerson B-DNA dodecamer solved at a high resolution by Williams and colleagues. Let the input PDB file be 355d.pdb, then the output file is automatically named 355d.alc, which can be displayed using RasMol with the options '-alchemy -noconnect'.
block_atom 355d.pdb
rasmol -alchemy -noconnect 355d.alc
One view of RasMol rendered image is as below, which shows more clearly the minor-groove (colored blue) of the DNA duplex and the various intra-base-pair deformations (e.g., propeller and buckle):
I initially wrote this utility script mostly for verification purpose, i.e., to make sure that the base reference frames are defined properly. For those who want to know more about 3DNA, having a look of the script (which is tiny) and understanding how it works could be a good exercise.

Saturday, June 5, 2010

How is a base-pair rectangular block defined in 3DNA?

One of 3DNA's unique features is the base-pair (and base) rectangular blocks, as shown in the figure below. Since such a schematic was initialized by Calladine and Drew in their Understanding DNA book, I normally refer it as the Calladine-Drew style representation.


By default, a base-pair [BP, (a)] has a dimension of 10x4.5x0.5 Å; a purine [R, (b) left] 4.5x4.5x0.5 Å; a pyrimidine [Y, (b) right] 3x4.5x0.5 Å; and a mean base [M, (c)], which is exactly half of the base-pair, 5x4.5x0.5 Å.

The blocks are put into specific files: Block_BP.alc, Block_R.alc, and Block_Y.alc respectively. To use M for R and Y, one just needs to copy file Block_M.alc to overwrite Block_R.alc and Block_Y.alc in the current directory or the system directory ($X3DNA/config/). These blocks are used in the rebuilding and visualization components of 3DNA (see my blog post "blocview: a simple, effective visualization tool for nucleic acid structures").

The blocks are stored in the ALCHEMY format for easy specification of the nodes and edges, and since ALCHEMY is a format supported by RasMol (and Jmol). As an example, Block_BP.alc has the following content:
   12 ATOMS,    12 BONDS
1 N -2.2500 5.0000 0.2500
2 N -2.2500 0.5000 0.2500
3 N -2.2500 0.5000 -0.2500
4 N -2.2500 5.0000 -0.2500
5 C 2.2500 3.5000 0.2500
6 C 2.2500 0.5000 0.2500
7 C 2.2500 0.5000 -0.2500
8 C 2.2500 3.5000 -0.2500
9 C -2.2500 5.0000 0.2500
10 C -2.2500 0.5000 0.2500
11 C -2.2500 0.5000 -0.2500
12 C -2.2500 5.0000 -0.2500
1 1 2
2 2 3
3 3 4
4 4 1
5 5 6
6 6 7
7 7 8
8 5 8
9 9 5
10 10 6
11 11 7
12 12 8
Astute viewers may observe that nodes 1-4 are identified as N (nitrogen) and have exactly the same coordinates as 9-12 (C, carbon). This is a trick so RasMol can display the the minor groove edge in a different color than the other five sides of the rectangular, as shown in the following figure:

Note that the rectangular is preset in the standard base reference frame. Thus the nodes have y-coordinates of +5 Å and -5 Å along the long edge of the base pair, and x-coordinates of +2.25 Å and -2.25 Å along the short edge.

As an extra bonus of storing the blocks in external ALCHEMY text files, the dimensions of the blocks are be readily changed. For example, the depth of a block (z-coordinates) can be easily increased from 0.5 to 1.0 Å to make it thicker (see $X3DNA/config/Block_BP1.alc). Moreover, the blocks do not need to be rectangular either – see $X3DNA/config/block/Block_R_nr.alc for an example.

Friday, May 28, 2010

Naming conventions: RasMol, PyMOL and Jmol

Over the years, I've used the following three molecular graphics programs the most: RasMol, PyMOL and Jmol. It is interesting to note their different naming conventions: while their names all end with m-o-l (for molecule, I believe), the cases (used in convention) are different: Mol as in RasMol, MOL as in PyMOL, and mol as in Jmol.

For Jmol, there is actually an FAQ item "How do you write Jmol?" that reads "Capital J, lower case mol. Please do not write it any other way." For PyMOL, I am not aware of a specification on its name – I know that's the "official name" from its home page. Regarding RasMol, that's the most common way to name it, even though the title of its 1995 publication is "RASMOL: biomolecular graphics for all."

As for the prefix in the names, J in Jmol stands for Java; Py in PyMOL for Python; and Ras in RasMol could well be the initials of Roger A. Sayle, the original author of the program.

Sunday, May 23, 2010

MODEL/ENDMDL records in PDB X-ray crystal structures

Over the years, I have always been using the PDB format for nucleic acid structures. Naturally I've thought that I know the format well, especially the "Coordinate Section." According to the PDB documentation,
MODEL/ENDMDL records are used only when more than one structure is presented in the entry, as is often the case with NMR entries.
In my experience, I have always connected the MODEL/ENDMDL pair only with the different models in an NMR entry. However, I was recently bitten by a subtlety in PDB format that is related to the MODEL/ENDMDL records in X-ray crystal structures which contain more than one asymmetric unit in their biological assembly.

As always, the point is best illustrated with an example – here I am using the four-stranded DNA Holliday junction X-ray crystal structure (1zf4/ud0061) solved by Ho and colleagues [PNAS 2005 May 17;102(20):7157-62]. As shown in the NDB website, the asymmetric unit of 1zf4/ud0061 contains only two chains; it is the biological assembly that has the four-stranded DNA junction.

I downloaded the biological assembly coordinates (in PDB format) from the NDB (named it '1zf4.pdb'), and ran blocview on it: blocview -i=1zf4.png 1zf4.pdb. However, the generated image [see Figure (A) below] has only half of what expected – as if the downloaded file contains only coordinates of an asymmetric unit. The mystery was gone when I checked the PDB file and realized that the MODEL/ENDMDL records now also apply to X-ray crystal structures to delineate symmetric-related units. Since 3DNA is designed to handle only one structure – it stops processing whenever an END or ENDMDL record is encountered. Simply change each 'ENDMDL' record to ' ENDMDL' (i.e., adding a space, or any character, for that matter) and run blocview again will get the expected image [Figure (B)].

Note that by default, RasMol also only displays the first model in a PDB file. To see multiple structures, the option '-nmrpdb' must be specified in the command line.

Saturday, May 15, 2010

3DNA forum registration -- your id

Over the past couple of months, I have received a few emails with a subject line, as the title of this post, "3DNA forum registration -- your id". Presumably, such an email is from a (potential) 3DNA user, requesting for activation after 3DNA forum registration.

Clearly, "your id" should have been a specified user id. In a couple of cases, I tried to figure out the corresponding 3DNA forum ids based on email addresses and activated them. More recently, I've switched to ask the sender a very simple question: "So what is 'your id' specifically?" In at least two cases, I have never heard back from the original senders with 'your id' specified. To a certain extent, this is a surprising result: providing such information won't take more than a couple of minutes, and this will help me (and others) to help them better with 3DNA-related questions. I can certainly understand occasional negligence, but users must follow some common-sense rules. Thus those registrations are deleted, as junk ones, after some grace period.

The 3DNA forum is open to public for browsing, without registration. At its current settings, no emails are sent from the forum; do not confuse an online forum with a mailing list. You only need to register if you want (and are certainly more than welcome) to get more actively involved in, e.g. asking/answering questions, sharing tips/tricks with other users. These policies are enforced to make the 3DNA forum free from spams, as much as possible.

Saturday, May 8, 2010

Some key combinations in MacBook Pro (Snow Leopard)

One of the inconveniences I experienced when first switching to a MacBook Pro running Mac OS X (Snow Leopard) was its missing of the PgUp, PgDn, Home and End keys. Over the past few months, I have learned that the functionality of such 'convenience' keys can be achieved via a combination of the arrow-keys (bottom right), with the 'fn', 'option', or 'command' keys (bottom left), as follows:
  • fn + right-arrow – end of a document
  • fn + left-arrow – beginning of a document
  • fn + up-arrow – Page up
  • fn + down-arrow – Page down
  • command + right-arrow – beginning of a line (Home)
  • command + left-arrow – end of a line (End)
  • option + right-arrow – one word to the right
  • option + left-arrow – one word to the left
According to Wikipedia, the 'fn' key "is a modifier key on many keyboards, especially on laptops, used in a compact layout to combine keys which are usually kept separate."

In addition to modifying cursor movements as noted above, the 'fn' key in Mac OS X can also be used to switch the default functionalities of F1 to F12 (e.g., F1 and F2 for screen brightness control) to the standard function keys. As an example, in MS Word, the keyboard shortcut for toggling case is "shift + f3", a trick I recently learned. In the default setting, one must press "fn + shift + f3" to achieve the desired effect.

Any tricks to share? I'd like to hear them!

Saturday, May 1, 2010

One year of blogging

When I checked the date of my first blog post today, I was a bit surprised to find that it is exactly one year since I begin to blog on May 2, 2009. Altogether, I have written over 60 posts, slightly more than one per week. At this time, I feel it appropriate to summarize my thought on blogging in general, to provide a perspective to those who care to visit here and make comments.
  • Why? The initial motivation was to use blog as a platform to express my personal views on issues I am interested in. As made clear in my first post, "this is Xiang-Jun's Corner on the Internet: all views are mine, and I am opinionated." Over the time, the blog posts have served as a convenient notebook (searchable and archived) either for my personal reference, or to refer others to a particular post (e.g., "On maintaining the 3DNA forum" when being asked a 3DNA-related question via email).
  • What? "Random thoughts, mostly on scientific issues". I write only on issues I am familiar with and feel comfortable to say something, to the limit that I can respond quickly and concretely to users comments. I am always open to suggestions and will be prompt in acknowledging errors and making corrections. So far, the largest portion of the posts has been devoted to nucleic acid structures in general, and 3DNA-related topics in particular.
  • How often? Due to time constraints, I will try to write one post per week at the minimum, maybe two, or in rare occasion three. Exceptions are possible, but by and large, I will aim for ~100 posts per year.
  • Comment? Currently, the policy is set such that "Anyone - includes Anonymous Users" can make a comment, and the comments are moderated. So far, I have always approved all the comments as soon as I see them in my gmail alert, and follow up where appropriate. Note that due to global time difference, commenters may experience some lag in time.
  • Does it work? Not unexpectedly, my blog has gradually attracted attentions from quite a broad audience, especially those interested in nucleic acid structures (3DNA), including leading scientists in the field (I know from emails I have received).
  • Hot posts? According to Google Analytics, the most frequently visited nine posts are as follows (with posting date in parentheses):
    1. Curves+ vs 3DNA (Sunday, August 16, 2009)
    2. Does 3DNA work for RNA? (Friday, July 10, 2009)
    3. Two web-interfaces to 3DNA, and more (Sunday, July 5, 2009)
    4. Fit a least squares plane to a set of points (Saturday, August 22, 2009)
    5. Two 3DNA figures made into a textbook on structural biology (Sunday, May 3, 2009)
    6. Chemical diagram of Watson-Crick base-pairs (Saturday, January 23, 2010)
    7. What's special about the GpU dinucleotide platform? (Friday, April 2, 2010)
    8. Double helix groove width parameters from 3DNA (Saturday, September 5, 2009)
    9. How to calculate torsion angle? (Saturday, October 31, 2009)

    While some of the posts are well-expected to be in the list, a few of them (e.g., ls-plane fitting, calculation of torsion angle) could look a bit surprising. It does, however, verify an observation based on my personal experience and intuitive feeling about a technical niche that my expertise can make a difference.
Again, as I wrote in my first blog post, "Now the ball is rolling, and only time can tell where the destination will be -- but surely it will no longer stand where it was!" One year later, I can confidently say that the ball in rolling in the right direction, as I'd have hoped for. Of course, I know for sure more time and efforts are need to move to the next level, and I value your feedback!

Friday, April 23, 2010

Life is complicated -- is there a way to make it simpler?

To mark the 10th anniversary on completion of the draft sequence of human genome, the April 01, 2010 issue of Nature [464 (7289)] published a series of interesting and revealing articles, including an Editorial and (historical) accounts from Francis Collins and Craig Venter. I browsed through the whole list, and I especially liked the News article titled "Life is complicated" by Erika Check Hayden, a senior reporter for Nature. I was attracted by its catchy title and brief summary: "The more biologists look, the more complexity there seems to be. Erika Check Hayden asks if there's a way to make life simpler." Obviously, the title of this blog post was inspired by the sources.

Over the past decade, the Human Genome Project has helped to clarify the number of genes from previously assumed ~100,000 to the "true" number of only ~21,000. The dramatically reduced number of genes illustrates the crucial importance of non-coding (used to be called "junk") DNA to biology, yet what non-coding DNA does is still befuddling. Nowadays, we are faced with data deluge from sequencing, gene expression, protein (transcription factor) binding, and other new technologies. The complexity of biology has grown significantly, instead of simplified, even for the most extensively studied protein p53. As put by Jennifer Doudna, “The more we know, the more we realize there is to know.”

The community has gradually realized that information gathering does not always bring corresponding increase of meaningful biological insights. We are facing an age of “drowning in information, starved for knowledge.” For example, systems biology, a new discipline “supposed to help scientists make sense of the complexity”, has turned out that “In many cases, the models themselves quickly become so complex that they are unlikely to reveal insights about the system, degenerating instead into mazes of interactions that are simply exercises in cataloguing.” I cannot agree more with the comment by Leonid Kruglyak: it is naive to think that “you can simply take very large amounts of data and run a data-mining program and understand what is going on in a generic way.”

Are we lost in the sea of biological data? Not necessarily. Eric Davidson's work is an excellent example in “taking smarter systems approaches” to reveal overarching biological rules. Instead of a “machine learning” type top-down approach, the “insights have come when scientists systematically analyse the components of processes that are easily manipulated in the laboratory — largely in model organisms. They’re still using a systems approach, but focusing it through a more traditional, bottom–up lens.” Through this systemic bottom-up approach, Davidson's group has deciphered the mechanism of how gene expressions are controlled through regulatory interactions and specify the construction of sea-urchin’s skeleton.

Interestingly, Eric Davidson gave a seminar at C2B2 on Thursday, April 22, titled "Causal Systems Biology: the Sea Urchin Embryo Gene Regulatory Network". I was very impressed by his talk, and the points he made in the Nature News article.



Friday, April 16, 2010

3DNA in the June 2010 issue of JBSD on "Current Perspectives on Nucleosome Positioning"

While updating 3DNA citations this week, I noticed five of them are from the same June 2010 issue of JOURNAL OF BIOMOLECULAR STRUCTURE & DYNAMICS (JBSD), which is focused on "Current Perspectives on Nucleosome Positioning". Most of the papers in this issue are from well-known laboratories in computational structural biology. It is my pleasure to see 3DNA being widely used in the important research area of nucleosome positioning.

I browsed through the abstracts of all the papers in this JBSD issue to refresh my knowledge of this field. While DNA sequence surely plays some role in nucleosome positioning, I remain to be convinced of the existence of a nucleosome "code" (yet) in the sense of the generally applicable "genetic code". Overall, DNA is so flexible and the signal is so week, thus allowing for tailored data fitting to specific analysis, which is not transferable to other situations. Clearly, the area is hot, yet still wide open.

Thursday, April 8, 2010

NSMB editorial: "Making your point-by-point"

In the April 2010 issue of Nature Structural & Molecular Biology [NSMB, 17(4)], there is another interesting editorial, titled "Making your point-by-point". This editorial addresses an important issue in the process of publishing papers in peer-reviewed journals, that is: how to make effective point-by-point response to "those ever-demanding editors and reviewers"?
Overall, it can be helpful to put yourself in the reviewer’s shoes and compose a response s/he would find appropriate, where the concerns raised are considered and fully addressed. In its ideal state, the review process is a positive and constructive back and forth, an intellectual discussion in which the manuscript is the ultimate beneficiary.

Here is my re-cap of the main points, as I understand it. I am also taking this opportunity to read this one-page editorial one more time.

What to do?

  • Keep to the point – "makes a series of [succinct] points in response [directly] to each point raised by the reviewers."
  • Keep it objective – be diplomatic in your point-by-point response to the reviewers, "even if the reviewer’s wording might have seemed overly strong." You could be forthright in your cover letter to the editors, though.
  • Keep things under control – "Know when to go to the bench and when to argue."
  • The scope of things – "Say clearly and succinctly" when "some requests might genuinely be beyond the scope of the manuscript or might simply be unfeasible." "Try not to salami-slice", one strong and solid paper is (much) better than two weak ones!

Some don'ts, especially:

  • Mentioning celebrity endorsements. "you never know—they could be moonlighting as your most critical anonymous reviewer."
  • Trying to guess who the reviewers are when communicating to the editors – it does not help. Additionally, you could be plain wrong in your guess (again, you never know) – they are anonymous, literally.
Generally speaking, I think authors should be appreciative of the work of the reviewers and editors. Occasionally, I serve as a reviewer and I know the time and efforts it takes to make a fair and thorough assessment of a manuscript.

It is certainly not just because of politeness that in our 2008 3DNA Nature Protocols paper, we acknowledged:
We also thank the editor and the anonymous reviewers whose comments helped to clarify the presentation of the protocols.
More recently, in our 2010 NAR GpU paper, we acknowledged:
They also thank the anonymous reviewers, whose comments helped clarify the presentation of the manuscript.