Xiang-Jun's Corner

Saturday, August 14, 2010

PDB format, how many variants are there?

In the field of structural biology of macromolecules (proteins and nucleic acids), the PDB format is still the primary one, even though it has shown its age/limitations, and mmCIF and PDBML have been introduced as alternatives.

In 3DNA, PDB is currently the only accepted format for the analysis routines (analyze/cehs), the primary format for rebuilt structures (rebuild/fiber), and various other output structure files (e.g., bestpairs.pdb etc.). By and large, 3DNA follows the PDB format documented at the RCSB website. In my maintaining and supporting of 3DNA and the forum over the years , I have come across other variants of the PDB format. Software packages customize PDB in subtle, sometimes substantial, ways that cause problems with 3DNA. Some examples follow:

Recently, a user tried to analyze an RNA duplex using w3DNA, but found that "the server seems not to recognize the file". It turned out that the so-called PDB format does not have residue name/number and chain id aligned in columns properly. For example, among other issues, the chain id in the problematic PDB is not at column #22 as specified by the standard.
A couple of months back, another user asked how to handle AMBER-introduced ribonucleotide abreviations (RA/RC/RG/RU). Compared to the above "PDB" variant, the AMBER PDB is well-behavored, and can be easily recognized by 3DNA after introducing corresponding one letter matches in file baselist.dat. Interestingly, a couple of years ago, DA/DC/DG/DT was introduced into the PDB format to distinguish deoxyribonucleotide (DNA) from ribonucleotide A/C/G/U (RNA). The AMBER convention, by using RA/RC/RG/RU for RNA, appears to be more symmetric.
From my (limited) experience in helping a 3DNA user several years ago, I know that CHARMM uses three-letter residue names for DNA (e.g., ADE/CYT/GUA/THY). There is actually a short Perl utility program (x3dna2charmm_pdb) in 3DNA for converting 3DNA generated PDB file to that recognized by CHARMM.
As noted in another post, "PDB ATOM coordinates record", Babel/Open Babel converted PDB files (mostly of small molecules) do not follow atom naming conventions.

As widely used and standard as PDB format is, there are still many variants in common use. Thus, when you meet a problem related to a PDB file, it helps to check if it conforms to the standard. With 3DNA, I've always tried to incorporate as many PDB variants as possible, so users can run 3DNA on them transparently.

Saturday, August 7, 2010

Calculation of DNA bending angle

Once in a while, I receive the question (via email or on the 3DNA forum) on how to calculate DNA bending angle, mostly associated with DNA-protein complexes. This question certainly qualifies as an FAQ; Indeed, I did consider writing an entry on it in the 3DNA website. However, I've hesitated to do so because the topic is more subtle and complicated than it appears.

On its face, an angle is defined by two vectors; let's call them a and b, and if each is normalized, then the angle (in degrees) between them is: acos(dot(a, b)) * 180/pi. Geometrically, after moving the tails of the vectors into the same position (e.g., origin), the heads would normally define a plane, unless a and b are strictly parallel (0°) or anti-parallel (180°).

DNA structures are three-dimensional, normally far more complicated than a single number can quantify. The concept of DNA bending angle, as I understand it, is only applicable to DNA structures with two relatively straight fragments (as in CAP-DNA complexes). Under such situations, one can fit a least-squares (ls) linear helical axis to each of the two fragments, and calculate the angle between them. Towards this end, 3DNA outputs the following section when it judges that the input structure is not strongly curved. Using 355d/bdl084, which is distributed with 3DNA, as an example:

Global linear helical axis defined by equivalent C1' and RN9/YN1 atom pairs                                    
Deviation from regular linear helix: 3.30(0.52)
Helix:    -0.127  -0.275  -0.953
HETATM 9998  XS    X X 999      17.536  25.713  25.665
HETATM 9999  XE    X X 999      12.911  15.677  -9.080
Average and standard deviation of helix radius:
P: 9.42(0.82), O4': 6.37(0.85),  C1': 5.85(0.86)

Where the Helix: line gives the normalized vector along the "best-fit" helical axis. The two HETATM records provides the two end points on the helix, and they are directly related to the Helix: line by a simple equation. Following the above example, we have (Octave/Matlab code):



XE = [12.911  15.677  -9.080];

XS = [17.536  25.713  25.665];



dd = XE - XS

%   -4.6250  -10.0360  -34.7450



Helix = dd / norm(dd)

%  -0.12685  -0.27526  -0.95296  ==> [-0.127  -0.275  -0.953]

With the two HETATM records, one can easily add them into the original PDB file to display the helical axis using a molecular graphics programs (e.g., RasMol, Jmol, PyMol). Moreover, the two helix vectors can be used to reorient the original PDB structure into a view so that one helical fragment lies along the x-axis, and the other in the xy-plane. As documented in detail in recipes #4 on "Automatic identification of double-helical regions in a DNA–RNA junction" (2008 3DNA Nature Protocols paper), "The chosen view allows for easy visualization and protractor measurement of the overall bending angle between the two relatively straight helices."

The following points are well worth noting:

The ls fitting procedure used in 3DNA follows SCHNAaP, which was based on the algorithm in the well-known NewHelix program, maintained by Dr. Richard Dickerson upto the 1990s. While fitting a global linear helical axis to strongly curved DNA structures clearly makes no much sense with derived parameters (NewHelix itself has been replaced by FreeHelix, also from Dickerson), I do believe it is meaningful to fit a linear helix with relatively straight DNA fragment. That's why I have kept this functionality in SCHNArP and 3DNA; 3DNA bending angle calculation serves as an example illustrating the point – it provides an "intuitive" way for biologists to understand how the bending angle is calculated; it can actually be measured directly.
Instead of directly ls-fitting a linear helical axis with 3DNA, one can alternatively superimpose a regular fiber model into the DNA fragment, and then derive the straight helical axis from the fitted coordinates. The two approaches normally gives slightly different numerical values, as would be expected.
Overall, bending angle is (at most) an approximate measure of DNA curvature. In my opinion, the concept is only applicable for comparing a set of structures, each with two relatively straight fragments. Even in such cases, the relative spatial relationship between two helical fragments is more complicated than a simple (bending) angle could quantify. Specially, it is not meaningful to make a selling point in slight differences of DNA bending angles.

Friday, August 6, 2010

Genomic code for nucleosome positioning?

Over the past few years, the research on nucleosome positioning has been a hot topic: Segal et al., through a series of high-profile publications, have proposed that "the intrinsic nucleosome sequence preferences contribute substantially to nucleosome organization and chromatin function in vivo" (i.e., there exists a "genomic code") and developed bioinformatics tools to predict patterns of nucleosome positions based on genomic DNA sequence. However, as is clear from the title of a paper by Zhang et al., "Intrinsic histone-DNA interactions are not the major determinant of nucleosome positions in vivo", it is highly debatable if the nucleosome sequence preferences is strong enough to be called a "code", in the canonical sense as for the genetic code.

While I am (generally) aware of the controversy about the "genomic code", my understanding of the issue has become much clearer after reading the correspondences between both sides published in the 17(8), August 2010 issue of Nature Structural & Molecular Biology [NSMB]:

"Nucleosome sequence preferences influence in vivo nucleosome organization" (pp918 - 920) by Kaplan … and Segal
"Evidence against a genomic code for nucleosome positioning" Reply to “Nucleosome sequence preferences influence in vivo nucleosome organization” (pp920 - 923) by Zhang … and Struhl
"A preoccupied position on nucleosomes" (p923) by Pugh

Reading carefully through the text, it is clear how important the little details could be: 20-bp vs 40-bp windows? How significant the preference should be to be claimed as a "code"? Pugh clarifies another source of confusion, in the terminology of nucleosome ‘occupancy’ vs ‘positioning’. In my understanding, the dispute on such details itself speaks strongly against the existence of "genomic code".

It should be noted that early work by Drew & Travers (JMB85) and Satchwell et al. (JMB86) provided evidence that the rotational positioning of DNA about the histone octamer is regulated by DNA sequence-dependent bending preferences. Goodsell & Dickerson quantified the Satchwell trinucleotide model by assigning a set of roll angles to predict DNA bending from base sequence. This bending model, along with five others, was implemented by SCHNArP for building approximate 3D DNA structures (with fixed values for slide, twist etc other parameters). However, this functionality is dropped out from 3DNA, simply because of the arbitrariness/subtleties in picking up the parameters.

Sunday, August 1, 2010

Principle axes of inertia as a mean to reorient a molecule

In structural biology, it is often helpful to reorient a molecule in a specific view for easy visualization and comparison. With no a priori knowledge of a feature, the (objective) principle axes of inertia can be used as a mean to automatically reset a structure into its most extended orientation (and two other orthogonal views). In 3DNA, rotate_mol has such a functionality which is explored by blocview to generate the characteristic, simplified molecular images adopted by the RCSB PDB and NDB.

Recently, the thread (initiated by Dr. Pascal Auffinger) at the 3DNA forum titled "rotate_mol ambiguous orientations" has led me to think the topic deeper. As always, the issue is best illustrated with an example. The file 355d_A6_T19.pdb contains the coordinates of the base pair A6-T19 of PDB entry 355d. Let xyz be a variable in Octave (Matlab) containing the coordinates of all atoms in file 355d_A6_T19.pdb, the following m-file provides the detailed procedure:

xyz = [
   13.954  12.080  15.071
   13.092  11.246  15.953
   14.769  11.453  14.020
   13.056  13.202  14.406
   12.184  13.965  15.222
   11.308  14.846  14.376
   12.033  15.991  13.896
   10.654  14.188  13.168
    9.262  14.480  13.256
   11.373  14.820  11.974
   11.810  16.185  12.511
   13.051  16.742  11.953
   14.253  16.097  11.802
   15.197  16.853  11.305
   14.585  18.084  11.106
   15.066  19.322  10.605
   16.329  19.520  10.209
   14.198  20.357  10.535
   12.938  20.162  10.946
   12.370  19.048  11.444
   13.259  18.034  11.497
   14.865  29.619   5.710
   14.172  30.844   5.281
   16.107  29.695   6.515
   13.846  28.701   6.510
   12.684  28.174   5.864
   11.909  27.311   6.826
   12.626  26.090   7.129
   11.584  27.960   8.173
   10.189  27.833   8.395
   12.409  27.166   9.184
   12.544  25.808   8.521
   13.716  24.979   8.867
   13.496  23.688   9.310
   12.381  23.202   9.467
   14.631  22.971   9.551
   15.943  23.388   9.393
   16.852  22.613   9.632
   16.107  24.759   8.938
   17.488  25.309   8.745
   14.995  25.475   8.711 ]

cov_mtx = cov(xyz)

% cov_mtx =
%     3.4094    2.3860   -1.4222
%     2.3860   33.7089  -15.7175
%    -1.4222  -15.7175    7.8786

[v, lambda] = eig(cov_mtx)

% v =
%    0.094830   0.992835  -0.072709
%    0.419520  -0.106092  -0.901525
%    0.902779  -0.054989   0.426575

% lambda =
%     0.42535          0          0
%           0    3.23319          0
%           0          0   41.33836

x = v(:, 2);
y = v(:, 3);
z = cross(x, y);

% z =
%   -0.094830
%   -0.419520
%   -0.902779

num = size(xyz, 1);
xyz2 = (xyz - ones(num, 1) * mean(xyz)) * [x y z]

% xyz2 =
%     1.148136   10.032849   -0.514731
%     0.332293   11.223635   -0.879359
%     2.081609   10.090517    0.619843
%    .................................
%     2.277813   -4.170312   -0.501254
%     3.601180   -4.848890   -0.688714
%     1.110301   -4.831784   -0.491249

Among other subtitles, note that:

lambda 3 has the largest value (41.33836), i.e., with the biggest variance, and its corresponding eigenvector is set as the vertical y-axis.
lambda 2 shows intermediate variation (3.23319), and its corresponding eigenvector is set as the horizontal x-axis.
The z-axis is derived as the cross-product of x- and y-axes to ensure a right-handed frame. Note that it is the opposite of eigenvector 1.
The original xyz coordinates are moved to its geometric center [xyz - ones(num, 1) * mean(xyz)], and then projected onto the preferred frame to get the transformed coordinates (xyz2) with the molecule in the most extended view (vertical then horizontal).

Saturday, July 24, 2010

3DNA for the analysis of molecular dynamics simulations

To my surprise and delight, 3DNA has been increasingly used in the analysis of trajectories of molecular dynamics (MD) simulations of nucleic acid structures. While I have heard of CHARM/AMBER for a long time and know of the basic principles of MD simulations, I have never performed any simulations (yet). Over the years, I have been mostly interested in methods development and algorithms implementation for better understanding of nucleic acids structures, and have always used NDB/PDB entries for my tests/applications. Thus unlike Curves+, 3DNA lacks direct support for the analysis of MD trajectories.

From the MD-related questions I have received over emails and on the 3DNA forum, I sense that users do not have that much of a problem or inconvenience in applying 3DNA for the task. There is one issue that does stand out, however, i.e., the use of find_pair for each and every structure in the trajectories to prepare an input file that contains base-pairing information for analyze to calculate various parameters. As an example, a recent article by Ricci et al., "Molecular Dynamics of DNA: Comparison of Force Fields and Terminal Nucleotide Definitions" (J Phys Chem B. 2010 Jul 8. [Epub ahead of print]), says that:

It is noteworthy that, because the DNA structure endured strong deformations in these simulations, 3DNA could not perform the analysis of the angles for all the 5000 snapshots. Actually, for system 3 the calculation of the backbone parameters could be performed in only 59% of the snapshots, whereas in system 4 only 18% of the configurations allowed this calculation.

Dr. Netz, the corresponding author, confirmed to me that:

I don't think this is a 3DNA problem, but is rather an effect of the distortion caused by the problems in parameterization of the GROMOS forcefield.

Essentially, though, the problem lies in the coupled use of find_pair/analyze; for strongly deformed DNA structures along the simulation trajectories, some base pairs melt out, and are thus beyond the recognition of find_pair. On the other hand, one does not need to run find_pair for each structure; simply prepare a template input file for analyze, with the help of find_pair and probably some manual editing, then all the structures can be analyzed, no matter how distorted they are. Furthermore, saving the find_pair step should speed up the analysis of thousands of structures, typical of MD simulations.

In fact, 3DNA (from v1.5) has a Perl script named nmr_strs that implements the above idea in a preliminary form. The script is intended to serve as a starting point for the analysis of multiple similar structures, in an NMR ensemble or from MD simulations. See 3DNA wiki page on nmr_strs for details.

In the long run, a better integration between 3DNA and AMBER or other MD packages would solve this problem from the upstream, thus making users' life (downstream) much easier. Until that happens, I hope this post could help clarify the issue somewhat in this important application area of 3DNA. As always, I greatly value your comments.

Friday, July 16, 2010

PDB fingerprints make salient features of PDB entries obvious at a glance

In the past couple of days, I have followed with great interest the thread "PDBprints - salient, at-a-glance info about PDB entries" in CCP4BB. The idea behind PDBprints (short for PDB finger prints) is very simple: now with over 66 thousands entries, no one could possibly be intimately familiar with many of the structures in PDB. Yet "it would be very useful to have a way of obtaining at-a-glance information about the salient features of one or more PDB entries." In this scheme, key features (e.g., experimental method, species, protein/NA/ligand components, published or not) are represented by stylized icons called PDBlogos; a set of which organized in a specific order gives PDBprints.

The initial announcement, posted on Thursday by Gerard Kleywegt (the current Director of PDBe), has been well-received by the community. I am very impressed by the quality of the (over a dozen) feedbacks on icons coloring for presence/absence of a feature, and the proper icons for species etc. On the other hand, I am surprised by the quick and concrete response Gerard posted today, addressing the various issues.

Over the years, I have taken PDB/NDB mostly as data repositories; I regularly download (update) nucleic-acid containing structures for local analyses. Among the three wwPDB sites (RCSB, PDBe and PDBj), I've only visited the RCSB site on a regular basis. Now I find more features are available from PDBe, which I will browse more in the future.

Saturday, July 3, 2010

In 3DNA, analyze and rebuild are two sides of the same coin

Recently, I noticed the following two papers:

"A measure of bending in nucleic acids structures applied to A-tract DNA" by Lankas et al. (2010), Nucleic Acids Res.,38(10), 3414–3422.
"The chirality of DNA: elasticity cross-terms at base-pair level including A-tracts and the influence of ionic strength" by Noy and Golestanian (2010), J. Phys. Chem. B, 114, 8022–8031.

They both cite 3DNA, where the combined usage of its analyze/rebuild components has played a significant role.

When I first had access to the CEHS scheme, I was immediately attracted by its mathematical rigor which allows for complete reversibility in DNA structural analysis and model rebuilding. Historically, CEHS was the initial seed of the SCHNAaP/SCHNArP programs, and analyze/rebuild in 3DNA were directly derived from them.

Frequently, I think of analyze/rebuild as two sides of the same coin: starting from a nucleic acid structure (e.g., a DNA double helix), the analyze program gives a set of base-pair (propeller, buckle etc) and step (roll, slide etc) parameters. The structural parameters can be used to rebuild the structure, which is virtually identical in base geometry (i.e., without taking consideration of the sugar-phosphate backbone) to the original structure. Conversely, analyzing the rebuilt structure again gives exactly the same set of structural parameters describing the relative base geometry.

Reversibility is essentially a simple concept, nevertheless a very powerful one. Over the years, I am glad to see more people are taking advantage of this 3DNA unique feature in addressing real-world problems related to nucleic acid structures. I am confident to see more such applications.

Sunday, June 27, 2010

Get all torsion angles in a nucleic acid structure

In the field of nucleic acid structure analysis, a commonly calculated set of parameters is the torsion angles. Specifically, they include the main chain and chi (χ) torsion angles defined as follows:

α: O3'(i-1)-P-O5'-C5'
β: P-O5'-C5'-C4'
γ: O5'-C5'-C4'-C3'
δ: C5'-C4'-C3'-O3'
ε: C4'-C3'-O3'-P(i+1)
ζ: C3'-O3'-P(i+1)-O5'(i+1)
χ: for pyrimidines (Y, i.e., T, U, C): O4'-C1'-N1-C2; for purines (R, i.e., A, G): O4'-C1'-N9-C4

A related set of parameters characterizes the sugar conformation:

ν0: C4'-O4'-C1'-C2'
ν1: O4'-C1'-C2'-C3'
ν2: C1'-C2'-C3'-C4'
ν3: C2'-C3'-C4'-O4'
ν4: C3'-C4'-O4'-C1'
tm: amplitude of pseudorotation of the sugar ring
P: phase angle of pseudorotation of the sugar ring

These torsion angles are clearly defined and are readily available from various informatics programs. Not surprisingly, 3DNA also provides a complete set of DNA/RNA backbone torsions, calculated robustly and efficiently. The key is the "-s" option of find_pair" program, which is nevertheless little used, mostly because it is not the default.

Using the Haloarcula marismortui 50S large ribosomal subunit as an example (1jj2), the output file 1jj2.outs from the following command contains all the above mentioned main chain and sugar conformational parameters:

find_pair -s 1jj2.pdb stdout | analyze

Please see the Jena "Nucleic acid backbone parameters" website for a diagram (based on Saenger's book) defining the various backbone torsion angles.

Sunday, June 20, 2010

Subscribing to mailing lists in daily digest mode

I am currently on quite a few mailing lists that are relevant to my research topics in a broad sense, including Jmol, pdb-l and CCP4. Over the years, I've found it very handy to keep informed of a field by following in its (major) mailing list. However, I could easily get lost with so many posts each day on some active lists. So subscribing to the daily digest mode, provided by many mailing list management software, has become a norm. Thus, I would receive only one (or a few) email(s) from a mailing list. I could then easily browse through the subject lines to decide if to read further of a specific topic.

Among the three lists mentioned above, Jmol is quite active with several core contributors, most impressively from Prof. Robert Hanson. The PDB mailing list is of unbelievably low volume, with less than a post per day. CCP4 bullet board is overall the most well-organized; I am surprised by the knowledge-base from the posts there – it is a great resource in structural biology.

AMBER mailing list and Computational Chemistry List (CCL) are the other two lists I subscribed to before. However, I could not get into the daily digest mode; soon my mail box was flooded by many unrelated posts so I had to un-subscribe from them.