Xiang-Jun's Corner

Sunday, September 19, 2010

New NDB entries have IDs start with the prefix NA?

Recently, while reading the article titled "Designing Triple Helical Fragments: The Crystal Structure of the Undecamer d(TGGCCTTAAGG) Mimicking T.AT Base Triplets" by Van Hecke (Crystal Growth & Design), I noticed the following:

The atomic coordinates and structure factors have been deposited in the Protein Data Bank⁴¹ and Nucleic Acid Database⁴² (PDB and NDB entry codes 3L1Q and NA0392, respectively).

The NDB id NA0392 reminded me of an email communication I had with Dr. Olson early this year when she told me of the id change of new NDB entries. Occupied with other issues, I did not pay much attention to this point until recently.

Over the years, NDB has established itself prominently as "a repository of three-dimensional structural information about nucleic acids". Traditionally, an NDB id has some associated "meanings", with noticeably exceptions, as discussed one year ago in my blog post "PDB id vs NDB id". Thus, it is not surprising that NDB decided to make changes in naming new ids. However, I cannot find any announcement on the id change in the NDB website; a Google search on "NDB id" did not uncover anything new either. Luckily, NDB provides search by release date ("Released Since"). After a few tries, I traced that the new id policy began to be implemented from around March 2010.

The new NDB id convention appears to be NA (presumably standing for for Nucleic Acid) followed by 4 digits. As more entries available, it is not that hard to imagine that the number of digits must be expanded to 5, 6, or even more. Naturally, PDB id, with a fixed 4-character length (up to now), is (far) more consistent than the NDB id is.

Sunday, September 12, 2010

"Code Complete", a practical handbook of software construction

Over the summer, I read quite a few books on C programming, and more generally on software construction. Among them, the book titled "Code Complete" (2nd edition) by Steve McConnell is the most comprehensive: with over 900 pages, it certainly serves as "a practical handbook of software construction".

I have been writting scientific software applications for over twenty years, using a variety of programming languages. Gradually, writing code per se is taking less time; nowadays by far the most time-consuming parts have increasingly become (1) to understand a scientific topic thoroughly in order to implement it in new code, and (2) to maintain/refine/adapt existing codebase to meet the needs of its user community (e.g., 3DNA). It is for the benefit of the later that I've found the "Code Complete" book very helpful.

As an example, following the book's recommendations (1) "Use boolean variables to simplify complicated tests" (pp301-2) and (2) "Simplify complicated tests with boolean function calls" (p359), I've easily improved the readability of some parts of my code.

Overall, the book is well-written and full of practical advices; I enjoyed reading it.

Sunday, September 5, 2010

Identification of C-H...N/O H-honds using 3DNA

Recently, I came across an interesting article by Kiliszek et al., titled "Atomic resolution structure of CAG RNA repeats: structural insights and implications for the trinucleotide repeat expansion diseases". In addition to its biological implications, this paper uses 3DNA to deal with non-canonical-base-pair-containing helical structures in a sensible way:

The helical parameters were calculated using 3DNA (29). Sequence-independent measures were used, based on vectors connecting the C1' atoms of the paired residues, to avoid computational artefacts arising from non-canonical base pairing.

Another significant point, which is the focus of this post, is the observation that "All the adenosines are in the anti-conformation and the only interaction within each A-A pair is a single C2-H2...N1 hydrogen bond." (Figure below) Given the 0.95 Å ultrahigh resolution of structure 3nj6/na0608, it is likely that this type of A-A is real.

The find_pair program in the currently distributed versions of 3DNA (v2.0 and before), however, does not identify this A-A pair (thus the corresponding NDB list of "Base Pair Step Parameters" is incomplete) for the following two reasons:

No H-bond exists between N/O base atoms – currently a requirement for a base-pair.
By default, only N/O atoms are used in defining H-bonds (see tag hb_atoms in file "misc_3dna.par"). Nevertheless, by adding C as a possible atom in forming H-bond, and manually editing find_pair generated input file to analyze, 3DNA structural parameters can be calculated as usual.

I have updated find_pair in 3DNA to identify such C-H...N/O H-bond automatically (see below for entry 3nj6). Upon further refinements and validations, future releases of 3DNA will have this functionality available.

    1   95  #    1 | ...1>A:...1_:[..G]G-----C[..C]:..10_:A<...2
    2   94  #    2 | ...1>A:...2_:[..G]G-----C[..C]:...9_:A<...2
    3   93  #    3 | ...1>A:...3_:[..C]C-----G[..G]:...8_:A<...2
    4   92  #    4 | ...1>A:...4_:[..A]A-**--A[..A]:...7_:A<...2
    5   91  #    5 | ...1>A:...5_:[..G]G-----C[..C]:...6_:A<...2
    6   90  #    6 | ...1>A:...6_:[..C]C-----G[..G]:...5_:A<...2
    7   89  #    7 | ...1>A:...7_:[..A]A-**--A[..A]:...4_:A<...2
    8   88  #    8 | ...1>A:...8_:[..G]G-----C[..C]:...3_:A<...2
    9   87  #    9 | ...1>A:...9_:[..C]C-----G[..G]:...2_:A<...2
   10   86  #   10 | ...1>A:..10_:[..C]C-----G[..G]:...1_:A<...2
##### Criteria: 4.00  0.00  15.00  2.50  65.00  4.50  7.50   [ O N C]
##### 2 non-Watson-Crick base-pairs, and 1 helix (0 isolated bps)
##### Helix #1 (10): 1 - 10

The Jmol paper: a paradigm shift in crystallographic visualization

From the Jmol mailing list, I noticed Prof. Robert Hanson's announcement titled "Jmol and crystallography: the paper". Finally, an "official" paper has been published on Jmol, an increasing popular molecular visualization tool.

While the paper is categorized under "teaching and education" and summarized as below:

Recent advances in molecular and crystallographic visualization methods are allowing instructors unprecedented opportunities to enhance student learning using virtual models within a familiar web-browser context. In step with these advances, the latest versions of the Jmol molecular visualization applet offer capabilities that hold potential for revolutionizing the way students learn about symmetry, uncertainty and the overall enterprise of molecular structure determination.

Jmol is certainly as useful for research as other commonly used molecular visualization tools, such as PyMOL and RasMol. Specifically, Jmol applets have become dominated in web-based applications/resources (including RCSB PDB). In fact, the HTML version of the paper itself makes extensive use of Jmol applet for interactive manipulation of over 20 figures.

Over the past couple of years, I have witnessed the many new features added to Jmol, driven by an active and friendly user community. I have been particularly impressed by Prof. Hanson's quick and to-the-point responses to users' requests for new features and bug reports, including to my request for Jmol's "Support of 'alchemy' format for rectangular schematic base-pair geometry". Overall, the paper presents a nice overview of some of Jmol's new functionality, and could become highly cited.

Reading through the paper, I was just a bit surprised that PyMOL is not mentioned. Nowadays, Jmol and PyMOL are apparently the top two molecular graphics software tools, with common and complementary functionality.

Saturday, August 21, 2010

Calculation of the chi (χ) torsion angle of pseudouridines in RNA structures

In RNA structures, chi (χ) is defined as the torsion angle about the N-glycosidic bond between the ribose sugar and the canonical A/C/G/U bases or their modified variants. Specifically, for pyrimidines (C and U), χ is defined by O4'-C1'-N1-C2; and for purines (A and G) by O4'-C1'-N9-C4. See the following figure from the IMB Jena Image Library:

Pseudouridine (5-ribosyluracil, PSU) was the first identified modified nucleoside in RNA and is the most abundant. PSU is unique in that it has a C-glycosidic bond instead of the N-glycosidic bond common to all other nucleosides, canonical or modified. It thus poses a problem as to how to calculate the χ torsion angle: should it be O4'-C1'-C5-C4 reflecting the actual glycosidic bond connection, or should the conventional definition O4'-C1'-N1-C2 still be applied literally? As a concrete example, the figure below shows the (slightly) different numerical values, as given by the two definitions, for PSU 6 on chain A of PDB/NDB entry 3cgp/ar0093.

Needless to say, definition of the χ torsion angle of PSU in RNA structures is a very subtle/minor point, and I am not aware of any discussion on this issue in literature (I'd appreciate your sharing of related information in the comment). In 3DNA, PSU is identified explicitly, and χ is defined by O4'-C1'-C5-C4. In NDB and a couple of other tools (I've played with), χ for PSU is defined by O4'-C1'-N1-C2. Again using 3cgp/ar0093 (figure above) as an example, 3DNA gives -162.7°, whilst NDB gives -163.9°. Hopefully, this post will help clarify a confusion for those who care about such little details.

Saturday, August 14, 2010

PDB format, how many variants are there?

In the field of structural biology of macromolecules (proteins and nucleic acids), the PDB format is still the primary one, even though it has shown its age/limitations, and mmCIF and PDBML have been introduced as alternatives.

In 3DNA, PDB is currently the only accepted format for the analysis routines (analyze/cehs), the primary format for rebuilt structures (rebuild/fiber), and various other output structure files (e.g., bestpairs.pdb etc.). By and large, 3DNA follows the PDB format documented at the RCSB website. In my maintaining and supporting of 3DNA and the forum over the years , I have come across other variants of the PDB format. Software packages customize PDB in subtle, sometimes substantial, ways that cause problems with 3DNA. Some examples follow:

Recently, a user tried to analyze an RNA duplex using w3DNA, but found that "the server seems not to recognize the file". It turned out that the so-called PDB format does not have residue name/number and chain id aligned in columns properly. For example, among other issues, the chain id in the problematic PDB is not at column #22 as specified by the standard.
A couple of months back, another user asked how to handle AMBER-introduced ribonucleotide abreviations (RA/RC/RG/RU). Compared to the above "PDB" variant, the AMBER PDB is well-behavored, and can be easily recognized by 3DNA after introducing corresponding one letter matches in file baselist.dat. Interestingly, a couple of years ago, DA/DC/DG/DT was introduced into the PDB format to distinguish deoxyribonucleotide (DNA) from ribonucleotide A/C/G/U (RNA). The AMBER convention, by using RA/RC/RG/RU for RNA, appears to be more symmetric.
From my (limited) experience in helping a 3DNA user several years ago, I know that CHARMM uses three-letter residue names for DNA (e.g., ADE/CYT/GUA/THY). There is actually a short Perl utility program (x3dna2charmm_pdb) in 3DNA for converting 3DNA generated PDB file to that recognized by CHARMM.
As noted in another post, "PDB ATOM coordinates record", Babel/Open Babel converted PDB files (mostly of small molecules) do not follow atom naming conventions.

As widely used and standard as PDB format is, there are still many variants in common use. Thus, when you meet a problem related to a PDB file, it helps to check if it conforms to the standard. With 3DNA, I've always tried to incorporate as many PDB variants as possible, so users can run 3DNA on them transparently.

Saturday, August 7, 2010

Calculation of DNA bending angle

Once in a while, I receive the question (via email or on the 3DNA forum) on how to calculate DNA bending angle, mostly associated with DNA-protein complexes. This question certainly qualifies as an FAQ; Indeed, I did consider writing an entry on it in the 3DNA website. However, I've hesitated to do so because the topic is more subtle and complicated than it appears.

On its face, an angle is defined by two vectors; let's call them a and b, and if each is normalized, then the angle (in degrees) between them is: acos(dot(a, b)) * 180/pi. Geometrically, after moving the tails of the vectors into the same position (e.g., origin), the heads would normally define a plane, unless a and b are strictly parallel (0°) or anti-parallel (180°).

DNA structures are three-dimensional, normally far more complicated than a single number can quantify. The concept of DNA bending angle, as I understand it, is only applicable to DNA structures with two relatively straight fragments (as in CAP-DNA complexes). Under such situations, one can fit a least-squares (ls) linear helical axis to each of the two fragments, and calculate the angle between them. Towards this end, 3DNA outputs the following section when it judges that the input structure is not strongly curved. Using 355d/bdl084, which is distributed with 3DNA, as an example:

Global linear helical axis defined by equivalent C1' and RN9/YN1 atom pairs                                    
Deviation from regular linear helix: 3.30(0.52)
Helix:    -0.127  -0.275  -0.953
HETATM 9998  XS    X X 999      17.536  25.713  25.665
HETATM 9999  XE    X X 999      12.911  15.677  -9.080
Average and standard deviation of helix radius:
P: 9.42(0.82), O4': 6.37(0.85),  C1': 5.85(0.86)

Where the Helix: line gives the normalized vector along the "best-fit" helical axis. The two HETATM records provides the two end points on the helix, and they are directly related to the Helix: line by a simple equation. Following the above example, we have (Octave/Matlab code):



XE = [12.911  15.677  -9.080];

XS = [17.536  25.713  25.665];



dd = XE - XS

%   -4.6250  -10.0360  -34.7450



Helix = dd / norm(dd)

%  -0.12685  -0.27526  -0.95296  ==> [-0.127  -0.275  -0.953]

With the two HETATM records, one can easily add them into the original PDB file to display the helical axis using a molecular graphics programs (e.g., RasMol, Jmol, PyMol). Moreover, the two helix vectors can be used to reorient the original PDB structure into a view so that one helical fragment lies along the x-axis, and the other in the xy-plane. As documented in detail in recipes #4 on "Automatic identification of double-helical regions in a DNA–RNA junction" (2008 3DNA Nature Protocols paper), "The chosen view allows for easy visualization and protractor measurement of the overall bending angle between the two relatively straight helices."

The following points are well worth noting:

The ls fitting procedure used in 3DNA follows SCHNAaP, which was based on the algorithm in the well-known NewHelix program, maintained by Dr. Richard Dickerson upto the 1990s. While fitting a global linear helical axis to strongly curved DNA structures clearly makes no much sense with derived parameters (NewHelix itself has been replaced by FreeHelix, also from Dickerson), I do believe it is meaningful to fit a linear helix with relatively straight DNA fragment. That's why I have kept this functionality in SCHNArP and 3DNA; 3DNA bending angle calculation serves as an example illustrating the point – it provides an "intuitive" way for biologists to understand how the bending angle is calculated; it can actually be measured directly.
Instead of directly ls-fitting a linear helical axis with 3DNA, one can alternatively superimpose a regular fiber model into the DNA fragment, and then derive the straight helical axis from the fitted coordinates. The two approaches normally gives slightly different numerical values, as would be expected.
Overall, bending angle is (at most) an approximate measure of DNA curvature. In my opinion, the concept is only applicable for comparing a set of structures, each with two relatively straight fragments. Even in such cases, the relative spatial relationship between two helical fragments is more complicated than a simple (bending) angle could quantify. Specially, it is not meaningful to make a selling point in slight differences of DNA bending angles.

Friday, August 6, 2010

Genomic code for nucleosome positioning?

Over the past few years, the research on nucleosome positioning has been a hot topic: Segal et al., through a series of high-profile publications, have proposed that "the intrinsic nucleosome sequence preferences contribute substantially to nucleosome organization and chromatin function in vivo" (i.e., there exists a "genomic code") and developed bioinformatics tools to predict patterns of nucleosome positions based on genomic DNA sequence. However, as is clear from the title of a paper by Zhang et al., "Intrinsic histone-DNA interactions are not the major determinant of nucleosome positions in vivo", it is highly debatable if the nucleosome sequence preferences is strong enough to be called a "code", in the canonical sense as for the genetic code.

While I am (generally) aware of the controversy about the "genomic code", my understanding of the issue has become much clearer after reading the correspondences between both sides published in the 17(8), August 2010 issue of Nature Structural & Molecular Biology [NSMB]:

"Nucleosome sequence preferences influence in vivo nucleosome organization" (pp918 - 920) by Kaplan … and Segal
"Evidence against a genomic code for nucleosome positioning" Reply to “Nucleosome sequence preferences influence in vivo nucleosome organization” (pp920 - 923) by Zhang … and Struhl
"A preoccupied position on nucleosomes" (p923) by Pugh

Reading carefully through the text, it is clear how important the little details could be: 20-bp vs 40-bp windows? How significant the preference should be to be claimed as a "code"? Pugh clarifies another source of confusion, in the terminology of nucleosome ‘occupancy’ vs ‘positioning’. In my understanding, the dispute on such details itself speaks strongly against the existence of "genomic code".

It should be noted that early work by Drew & Travers (JMB85) and Satchwell et al. (JMB86) provided evidence that the rotational positioning of DNA about the histone octamer is regulated by DNA sequence-dependent bending preferences. Goodsell & Dickerson quantified the Satchwell trinucleotide model by assigning a set of roll angles to predict DNA bending from base sequence. This bending model, along with five others, was implemented by SCHNArP for building approximate 3D DNA structures (with fixed values for slide, twist etc other parameters). However, this functionality is dropped out from 3DNA, simply because of the arbitrariness/subtleties in picking up the parameters.

Sunday, August 1, 2010

Principle axes of inertia as a mean to reorient a molecule

In structural biology, it is often helpful to reorient a molecule in a specific view for easy visualization and comparison. With no a priori knowledge of a feature, the (objective) principle axes of inertia can be used as a mean to automatically reset a structure into its most extended orientation (and two other orthogonal views). In 3DNA, rotate_mol has such a functionality which is explored by blocview to generate the characteristic, simplified molecular images adopted by the RCSB PDB and NDB.

Recently, the thread (initiated by Dr. Pascal Auffinger) at the 3DNA forum titled "rotate_mol ambiguous orientations" has led me to think the topic deeper. As always, the issue is best illustrated with an example. The file 355d_A6_T19.pdb contains the coordinates of the base pair A6-T19 of PDB entry 355d. Let xyz be a variable in Octave (Matlab) containing the coordinates of all atoms in file 355d_A6_T19.pdb, the following m-file provides the detailed procedure:

xyz = [
   13.954  12.080  15.071
   13.092  11.246  15.953
   14.769  11.453  14.020
   13.056  13.202  14.406
   12.184  13.965  15.222
   11.308  14.846  14.376
   12.033  15.991  13.896
   10.654  14.188  13.168
    9.262  14.480  13.256
   11.373  14.820  11.974
   11.810  16.185  12.511
   13.051  16.742  11.953
   14.253  16.097  11.802
   15.197  16.853  11.305
   14.585  18.084  11.106
   15.066  19.322  10.605
   16.329  19.520  10.209
   14.198  20.357  10.535
   12.938  20.162  10.946
   12.370  19.048  11.444
   13.259  18.034  11.497
   14.865  29.619   5.710
   14.172  30.844   5.281
   16.107  29.695   6.515
   13.846  28.701   6.510
   12.684  28.174   5.864
   11.909  27.311   6.826
   12.626  26.090   7.129
   11.584  27.960   8.173
   10.189  27.833   8.395
   12.409  27.166   9.184
   12.544  25.808   8.521
   13.716  24.979   8.867
   13.496  23.688   9.310
   12.381  23.202   9.467
   14.631  22.971   9.551
   15.943  23.388   9.393
   16.852  22.613   9.632
   16.107  24.759   8.938
   17.488  25.309   8.745
   14.995  25.475   8.711 ]

cov_mtx = cov(xyz)

% cov_mtx =
%     3.4094    2.3860   -1.4222
%     2.3860   33.7089  -15.7175
%    -1.4222  -15.7175    7.8786

[v, lambda] = eig(cov_mtx)

% v =
%    0.094830   0.992835  -0.072709
%    0.419520  -0.106092  -0.901525
%    0.902779  -0.054989   0.426575

% lambda =
%     0.42535          0          0
%           0    3.23319          0
%           0          0   41.33836

x = v(:, 2);
y = v(:, 3);
z = cross(x, y);

% z =
%   -0.094830
%   -0.419520
%   -0.902779

num = size(xyz, 1);
xyz2 = (xyz - ones(num, 1) * mean(xyz)) * [x y z]

% xyz2 =
%     1.148136   10.032849   -0.514731
%     0.332293   11.223635   -0.879359
%     2.081609   10.090517    0.619843
%    .................................
%     2.277813   -4.170312   -0.501254
%     3.601180   -4.848890   -0.688714
%     1.110301   -4.831784   -0.491249

Among other subtitles, note that:

lambda 3 has the largest value (41.33836), i.e., with the biggest variance, and its corresponding eigenvector is set as the vertical y-axis.
lambda 2 shows intermediate variation (3.23319), and its corresponding eigenvector is set as the horizontal x-axis.
The z-axis is derived as the cross-product of x- and y-axes to ensure a right-handed frame. Note that it is the opposite of eigenvector 1.
The original xyz coordinates are moved to its geometric center [xyz - ones(num, 1) * mean(xyz)], and then projected onto the preferred frame to get the transformed coordinates (xyz2) with the molecule in the most extended view (vertical then horizontal).