Saturday, August 14, 2010

PDB format, how many variants are there?

In the field of structural biology of macromolecules (proteins and nucleic acids), the PDB format is still the primary one, even though it has shown its age/limitations, and mmCIF and PDBML have been introduced as alternatives.

In 3DNA, PDB is currently the only accepted format for the analysis routines (analyze/cehs), the primary format for rebuilt structures (rebuild/fiber), and various other output structure files (e.g., bestpairs.pdb etc.). By and large, 3DNA follows the PDB format documented at the RCSB website. In my maintaining and supporting of 3DNA and the forum over the years , I have come across other variants of the PDB format. Software packages customize PDB in subtle, sometimes substantial, ways that cause problems with 3DNA. Some examples follow:
  • Recently, a user tried to analyze an RNA duplex using w3DNA, but found that "the server seems not to recognize the file". It turned out that the so-called PDB format does not have residue name/number and chain id aligned in columns properly. For example, among other issues, the chain id in the problematic PDB is not at column #22 as specified by the standard.
  • A couple of months back, another user asked how to handle AMBER-introduced ribonucleotide abreviations (RA/RC/RG/RU). Compared to the above "PDB" variant, the AMBER PDB is well-behavored, and can be easily recognized by 3DNA after introducing corresponding one letter matches in file baselist.dat. Interestingly, a couple of years ago, DA/DC/DG/DT was introduced into the PDB format to distinguish deoxyribonucleotide (DNA) from ribonucleotide A/C/G/U (RNA). The AMBER convention, by using RA/RC/RG/RU for RNA, appears to be more symmetric.
  • From my (limited) experience in helping a 3DNA user several years ago, I know that CHARMM uses three-letter residue names for DNA (e.g., ADE/CYT/GUA/THY). There is actually a short Perl utility program (x3dna2charmm_pdb) in 3DNA for converting 3DNA generated PDB file to that recognized by CHARMM.
  • As noted in another post, "PDB ATOM coordinates record", Babel/Open Babel converted PDB files (mostly of small molecules) do not follow atom naming conventions.
As widely used and standard as PDB format is, there are still many variants in common use. Thus, when you meet a problem related to a PDB file, it helps to check if it conforms to the standard. With 3DNA, I've always tried to incorporate as many PDB variants as possible, so users can run 3DNA on them transparently.

Saturday, August 7, 2010

Calculation of DNA bending angle

Once in a while, I receive the question (via email or on the 3DNA forum) on how to calculate DNA bending angle, mostly associated with DNA-protein complexes. This question certainly qualifies as an FAQ; Indeed, I did consider writing an entry on it in the 3DNA website. However, I've hesitated to do so because the topic is more subtle and complicated than it appears.

On its face, an angle is defined by two vectors; let's call them a and b, and if each is normalized, then the angle (in degrees) between them is: acos(dot(a, b)) * 180/pi. Geometrically, after moving the tails of the vectors into the same position (e.g., origin), the heads would normally define a plane, unless a and b are strictly parallel (0°) or anti-parallel (180°).

DNA structures are three-dimensional, normally far more complicated than a single number can quantify. The concept of DNA bending angle, as I understand it, is only applicable to DNA structures with two relatively straight fragments (as in CAP-DNA complexes). Under such situations, one can fit a least-squares (ls) linear helical axis to each of the two fragments, and calculate the angle between them. Towards this end, 3DNA outputs the following section when it judges that the input structure is not strongly curved. Using 355d/bdl084, which is distributed with 3DNA, as an example:
Global linear helical axis defined by equivalent C1' and RN9/YN1 atom pairs                                    
Deviation from regular linear helix: 3.30(0.52)
Helix:    -0.127  -0.275  -0.953
HETATM 9998  XS    X X 999      17.536  25.713  25.665
HETATM 9999  XE    X X 999      12.911  15.677  -9.080
Average and standard deviation of helix radius:
P: 9.42(0.82), O4': 6.37(0.85),  C1': 5.85(0.86)
Where the Helix: line gives the normalized vector along the "best-fit" helical axis. The two HETATM records provides the two end points on the helix, and they are directly related to the Helix: line by a simple equation. Following the above example, we have (Octave/Matlab code):

XE = [12.911 15.677 -9.080];
XS = [17.536 25.713 25.665];

dd = XE - XS
% -4.6250 -10.0360 -34.7450

Helix = dd / norm(dd)
% -0.12685 -0.27526 -0.95296 ==> [-0.127 -0.275 -0.953]

With the two HETATM records, one can easily add them into the original PDB file to display the helical axis using a molecular graphics programs (e.g., RasMol, Jmol, PyMol). Moreover, the two helix vectors can be used to reorient the original PDB structure into a view so that one helical fragment lies along the x-axis, and the other in the xy-plane. As documented in detail in recipes #4 on "Automatic identification of double-helical regions in a DNA–RNA junction" (2008 3DNA Nature Protocols paper), "The chosen view allows for easy visualization and protractor measurement of the overall bending angle between the two relatively straight helices."

The following points are well worth noting:
  • The ls fitting procedure used in 3DNA follows SCHNAaP, which was based on the algorithm in the well-known NewHelix program, maintained by Dr. Richard Dickerson upto the 1990s. While fitting a global linear helical axis to strongly curved DNA structures clearly makes no much sense with derived parameters (NewHelix itself has been replaced by FreeHelix, also from Dickerson), I do believe it is meaningful to fit a linear helix with relatively straight DNA fragment. That's why I have kept this functionality in SCHNArP and 3DNA; 3DNA bending angle calculation serves as an example illustrating the point it provides an "intuitive" way for biologists to understand how the bending angle is calculated; it can actually be measured directly.
  • Instead of directly ls-fitting a linear helical axis with 3DNA, one can alternatively superimpose a regular fiber model into the DNA fragment, and then derive the straight helical axis from the fitted coordinates. The two approaches normally gives slightly different numerical values, as would be expected.
  • Overall, bending angle is (at most) an approximate measure of DNA curvature. In my opinion, the concept is only applicable for comparing a set of structures, each with two relatively straight fragments. Even in such cases, the relative spatial relationship between two helical fragments is more complicated than a simple (bending) angle could quantify. Specially, it is not meaningful to make a selling point in slight differences of DNA bending angles.

Friday, August 6, 2010

Genomic code for nucleosome positioning?

Over the past few years, the research on nucleosome positioning has been a hot topic: Segal et al., through a series of high-profile publications, have proposed that "the intrinsic nucleosome sequence preferences contribute substantially to nucleosome organization and chromatin function in vivo" (i.e., there exists a "genomic code") and developed bioinformatics tools to predict patterns of nucleosome positions based on genomic DNA sequence. However, as is clear from the title of a paper by Zhang et al., "Intrinsic histone-DNA interactions are not the major determinant of nucleosome positions in vivo", it is highly debatable if the nucleosome sequence preferences is strong enough to be called a "code", in the canonical sense as for the genetic code.

While I am (generally) aware of the controversy about the "genomic code", my understanding of the issue has become much clearer after reading the correspondences between both sides published in the 17(8), August 2010 issue of Nature Structural & Molecular Biology [NSMB]:
  • "Nucleosome sequence preferences influence in vivo nucleosome organization" (pp918 - 920) by Kaplan … and Segal
  • "Evidence against a genomic code for nucleosome positioning" Reply to “Nucleosome sequence preferences influence in vivo nucleosome organization” (pp920 - 923) by Zhang … and Struhl
  • "A preoccupied position on nucleosomes" (p923) by Pugh
Reading carefully through the text, it is clear how important the little details could be: 20-bp vs 40-bp windows? How significant the preference should be to be claimed as a "code"? Pugh clarifies another source of confusion, in the terminology of nucleosome ‘occupancy’ vs ‘positioning’. In my understanding, the dispute on such details itself speaks strongly against the existence of "genomic code".

It should be noted that early work by Drew & Travers (JMB85) and Satchwell et al. (JMB86) provided evidence that the rotational positioning of DNA about the histone octamer is regulated by DNA sequence-dependent bending preferences. Goodsell & Dickerson quantified the Satchwell trinucleotide model by assigning a set of roll angles to predict DNA bending from base sequence. This bending model, along with five others, was implemented by SCHNArP for building approximate 3D DNA structures (with fixed values for slide, twist etc other parameters). However, this functionality is dropped out from 3DNA, simply because of the arbitrariness/subtleties in picking up the parameters.

Sunday, August 1, 2010

Principle axes of inertia as a mean to reorient a molecule

In structural biology, it is often helpful to reorient a molecule in a specific view for easy visualization and comparison. With no a priori knowledge of a feature, the (objective) principle axes of inertia can be used as a mean to automatically reset a structure into its most extended orientation (and two other orthogonal views). In 3DNA, rotate_mol has such a functionality which is explored by blocview to generate the characteristic, simplified molecular images adopted by the RCSB PDB and NDB.

Recently, the thread (initiated by Dr. Pascal Auffinger) at the 3DNA forum titled "rotate_mol ambiguous orientations" has led me to think the topic deeper. As always, the issue is best illustrated with an example. The file 355d_A6_T19.pdb contains the coordinates of the base pair A6-T19 of PDB entry 355d. Let xyz be a variable in Octave (Matlab) containing the coordinates of all atoms in file 355d_A6_T19.pdb, the following m-file provides the detailed procedure:
xyz = [
13.954 12.080 15.071
13.092 11.246 15.953
14.769 11.453 14.020
13.056 13.202 14.406
12.184 13.965 15.222
11.308 14.846 14.376
12.033 15.991 13.896
10.654 14.188 13.168
9.262 14.480 13.256
11.373 14.820 11.974
11.810 16.185 12.511
13.051 16.742 11.953
14.253 16.097 11.802
15.197 16.853 11.305
14.585 18.084 11.106
15.066 19.322 10.605
16.329 19.520 10.209
14.198 20.357 10.535
12.938 20.162 10.946
12.370 19.048 11.444
13.259 18.034 11.497
14.865 29.619 5.710
14.172 30.844 5.281
16.107 29.695 6.515
13.846 28.701 6.510
12.684 28.174 5.864
11.909 27.311 6.826
12.626 26.090 7.129
11.584 27.960 8.173
10.189 27.833 8.395
12.409 27.166 9.184
12.544 25.808 8.521
13.716 24.979 8.867
13.496 23.688 9.310
12.381 23.202 9.467
14.631 22.971 9.551
15.943 23.388 9.393
16.852 22.613 9.632
16.107 24.759 8.938
17.488 25.309 8.745
14.995 25.475 8.711 ]

cov_mtx = cov(xyz)

% cov_mtx =
% 3.4094 2.3860 -1.4222
% 2.3860 33.7089 -15.7175
% -1.4222 -15.7175 7.8786

[v, lambda] = eig(cov_mtx)

% v =
% 0.094830 0.992835 -0.072709
% 0.419520 -0.106092 -0.901525
% 0.902779 -0.054989 0.426575

% lambda =
% 0.42535 0 0
% 0 3.23319 0
% 0 0 41.33836

x = v(:, 2);
y = v(:, 3);
z = cross(x, y);

% z =
% -0.094830
% -0.419520
% -0.902779

num = size(xyz, 1);
xyz2 = (xyz - ones(num, 1) * mean(xyz)) * [x y z]

% xyz2 =
% 1.148136 10.032849 -0.514731
% 0.332293 11.223635 -0.879359
% 2.081609 10.090517 0.619843
% .................................
% 2.277813 -4.170312 -0.501254
% 3.601180 -4.848890 -0.688714
% 1.110301 -4.831784 -0.491249
Among other subtitles, note that:
  1. lambda 3 has the largest value (41.33836), i.e., with the biggest variance, and its corresponding eigenvector is set as the vertical y-axis.
  2. lambda 2 shows intermediate variation (3.23319), and its corresponding eigenvector is set as the horizontal x-axis.
  3. The z-axis is derived as the cross-product of x- and y-axes to ensure a right-handed frame. Note that it is the opposite of eigenvector 1.
  4. The original xyz coordinates are moved to its geometric center [xyz - ones(num, 1) * mean(xyz)], and then projected onto the preferred frame to get the transformed coordinates (xyz2) with the molecule in the most extended view (vertical then horizontal).

Saturday, July 24, 2010

3DNA for the analysis of molecular dynamics simulations

To my surprise and delight, 3DNA has been increasingly used in the analysis of trajectories of molecular dynamics (MD) simulations of nucleic acid structures. While I have heard of CHARM/AMBER for a long time and know of the basic principles of MD simulations, I have never performed any simulations (yet). Over the years, I have been mostly interested in methods development and algorithms implementation for better understanding of nucleic acids structures, and have always used NDB/PDB entries for my tests/applications. Thus unlike Curves+, 3DNA lacks direct support for the analysis of MD trajectories.

From the MD-related questions I have received over emails and on the 3DNA forum, I sense that users do not have that much of a problem or inconvenience in applying 3DNA for the task. There is one issue that does stand out, however, i.e., the use of find_pair for each and every structure in the trajectories to prepare an input file that contains base-pairing information for analyze to calculate various parameters. As an example, a recent article by Ricci et al., "Molecular Dynamics of DNA: Comparison of Force Fields and Terminal Nucleotide Definitions" (J Phys Chem B. 2010 Jul 8. [Epub ahead of print]), says that:
It is noteworthy that, because the DNA structure endured strong deformations in these simulations, 3DNA could not perform the analysis of the angles for all the 5000 snapshots. Actually, for system 3 the calculation of the backbone parameters could be performed in only 59% of the snapshots, whereas in system 4 only 18% of the configurations allowed this calculation.
Dr. Netz, the corresponding author, confirmed to me that:
I don't think this is a 3DNA problem, but is rather an effect of the distortion caused by the problems in parameterization of the GROMOS forcefield.
Essentially, though, the problem lies in the coupled use of find_pair/analyze; for strongly deformed DNA structures along the simulation trajectories, some base pairs melt out, and are thus beyond the recognition of find_pair. On the other hand, one does not need to run find_pair for each structure; simply prepare a template input file for analyze, with the help of find_pair and probably some manual editing, then all the structures can be analyzed, no matter how distorted they are. Furthermore, saving the find_pair step should speed up the analysis of thousands of structures, typical of MD simulations.

In fact, 3DNA (from v1.5) has a Perl script named nmr_strs that implements the above idea in a preliminary form. The script is intended to serve as a starting point for the analysis of multiple similar structures, in an NMR ensemble or from MD simulations. See 3DNA wiki page on nmr_strs for details.

In the long run, a better integration between 3DNA and AMBER or other MD packages would solve this problem from the upstream, thus making users' life (downstream) much easier. Until that happens, I hope this post could help clarify the issue somewhat in this important application area of 3DNA. As always, I greatly value your comments.

Friday, July 16, 2010

PDB fingerprints make salient features of PDB entries obvious at a glance

In the past couple of days, I have followed with great interest the thread "PDBprints - salient, at-a-glance info about PDB entries" in CCP4BB. The idea behind PDBprints (short for PDB finger prints) is very simple: now with over 66 thousands entries, no one could possibly be intimately familiar with many of the structures in PDB. Yet "it would be very useful to have a way of obtaining at-a-glance information about the salient features of one or more PDB entries." In this scheme, key features (e.g., experimental method, species, protein/NA/ligand components, published or not) are represented by stylized icons called PDBlogos; a set of which organized in a specific order gives PDBprints.

The initial announcement, posted on Thursday by Gerard Kleywegt (the current Director of PDBe), has been well-received by the community. I am very impressed by the quality of the (over a dozen) feedbacks on icons coloring for presence/absence of a feature, and the proper icons for species etc. On the other hand, I am surprised by the quick and concrete response Gerard posted today, addressing the various issues.

Over the years, I have taken PDB/NDB mostly as data repositories; I regularly download (update) nucleic-acid containing structures for local analyses. Among the three wwPDB sites (RCSB, PDBe and PDBj), I've only visited the RCSB site on a regular basis. Now I find more features are available from PDBe, which I will browse more in the future.

Saturday, July 3, 2010

In 3DNA, analyze and rebuild are two sides of the same coin

Recently, I noticed the following two papers:
They both cite 3DNA, where the combined usage of its analyze/rebuild components has played a significant role.

When I first had access to the CEHS scheme, I was immediately attracted by its mathematical rigor which allows for complete reversibility in DNA structural analysis and model rebuilding. Historically, CEHS was the initial seed of the SCHNAaP/SCHNArP programs, and analyze/rebuild in 3DNA were directly derived from them.

Frequently, I think of analyze/rebuild as two sides of the same coin: starting from a nucleic acid structure (e.g., a DNA double helix), the analyze program gives a set of base-pair (propeller, buckle etc) and step (roll, slide etc) parameters. The structural parameters can be used to rebuild the structure, which is virtually identical in base geometry (i.e., without taking consideration of the sugar-phosphate backbone) to the original structure. Conversely, analyzing the rebuilt structure again gives exactly the same set of structural parameters describing the relative base geometry.

Reversibility is essentially a simple concept, nevertheless a very powerful one. Over the years, I am glad to see more people are taking advantage of this 3DNA unique feature in addressing real-world problems related to nucleic acid structures. I am confident to see more such applications.

Sunday, June 27, 2010

Get all torsion angles in a nucleic acid structure

In the field of nucleic acid structure analysis, a commonly calculated set of parameters is the torsion angles. Specifically, they include the main chain and chi (χ) torsion angles defined as follows:
  • α: O3'(i-1)-P-O5'-C5'
  • β: P-O5'-C5'-C4'
  • γ: O5'-C5'-C4'-C3'
  • δ: C5'-C4'-C3'-O3'
  • ε: C4'-C3'-O3'-P(i+1)
  • ζ: C3'-O3'-P(i+1)-O5'(i+1)
  • χ: for pyrimidines (Y, i.e., T, U, C): O4'-C1'-N1-C2; for purines (R, i.e., A, G): O4'-C1'-N9-C4
A related set of parameters characterizes the sugar conformation:
  • ν0: C4'-O4'-C1'-C2'
  • ν1: O4'-C1'-C2'-C3'
  • ν2: C1'-C2'-C3'-C4'
  • ν3: C2'-C3'-C4'-O4'
  • ν4: C3'-C4'-O4'-C1'
  • tm: amplitude of pseudorotation of the sugar ring
  • P: phase angle of pseudorotation of the sugar ring
These torsion angles are clearly defined and are readily available from various informatics programs. Not surprisingly, 3DNA also provides a complete set of DNA/RNA backbone torsions, calculated robustly and efficiently. The key is the "-s" option of find_pair" program, which is nevertheless little used, mostly because it is not the default.

Using the Haloarcula marismortui 50S large ribosomal subunit as an example (1jj2), the output file 1jj2.outs from the following command contains all the above mentioned main chain and sugar conformational parameters:
find_pair -s 1jj2.pdb stdout | analyze
Please see the Jena "Nucleic acid backbone parameters" website for a diagram (based on Saenger's book) defining the various backbone torsion angles.

Sunday, June 20, 2010

Subscribing to mailing lists in daily digest mode

I am currently on quite a few mailing lists that are relevant to my research topics in a broad sense, including Jmol, pdb-l and CCP4. Over the years, I've found it very handy to keep informed of a field by following in its (major) mailing list. However, I could easily get lost with so many posts each day on some active lists. So subscribing to the daily digest mode, provided by many mailing list management software, has become a norm. Thus, I would receive only one (or a few) email(s) from a mailing list. I could then easily browse through the subject lines to decide if to read further of a specific topic.

Among the three lists mentioned above, Jmol is quite active with several core contributors, most impressively from Prof. Robert Hanson. The PDB mailing list is of unbelievably low volume, with less than a post per day. CCP4 bullet board is overall the most well-organized; I am surprised by the knowledge-base from the posts there – it is a great resource in structural biology.

AMBER mailing list and Computational Chemistry List (CCL) are the other two lists I subscribed to before. However, I could not get into the daily digest mode; soon my mail box was flooded by many unrelated posts so I had to un-subscribe from them.