Xiang-Jun's Corner

Sunday, June 26, 2011

PDB v3.3 and partial atom occupancy

From the PDB mailing list, I know of the recent announcement "PDB Archive Version 4.0 to be Released July 13, 2011". This "ambitious" review of the PDB archive has resulted in a new set of corrected files in ten categories, including biological assemblies, residual B factors, peptide inhibitors/antibiotics, polymers containing nonstandard polymer linkages, and partial occupancy etc. "These data reflect the wwPDB's continuing commitment to providing accurate and detailed data to users worldwide."

I am interested in changes in the PDB format, and read the PDF document "Description of Changes and Corrections for PDB July 2011 Remediation Release":

PDB format files updated in this remediation release comply with PDB Format Version 3.30. PDBx and PDBML data files comply with the PDB Exchange Dictionary v.4.0, and PDBML XML Schema V4.0, respectively.

Specifically, I checked carefully the section on "Partial occupancy", which is quoted in full below:

Problem

In the 2009 remediation, occupancies were corrected in 490 X-ray
and neutron entries. A mistake was made in 104 of these entries:
for atoms with alternate conformer labels and with summed total
occupancy less than 1.0, the occupancies were re-scaled as 1.0/n,
where n is the number of conformers.

Approach

The originally deposited occupancies of the affected atoms were
restored and the remediation was then carried out properly, via:

Atoms with multiple conformations but identical coordinates and B-values were merged and their occupancies were summed.

Atoms which now have (total) occupancies <= 1.0 were left as deposited.

Atoms with (total) occupancies > 1 were rescaled proportionally to a sum of 1.0

Results

The occupancies have been corrected in these entries.

This partial occupancy issue reminds me of an extensive and very informative discussion a few months back in CCP4BB, under the thread "what to do with disordered side chains" and its derivatives, about setting "zero" occupancy and/or high B values for PDB ATOM/HETATM records in disordered regions. Over the course of the discussion, Frances Bernstein made the following comment:

I am absolutely positive that there is software that does its voodoo on ATOM/HETATM records and pays absolutely no attention to anything beyond the x, y, z coordinates (i.e. beyond column 54).

3DNA does pay attention beyond column 54 (up to 80, actually) for ATOM/HETATM records, but internally it does not make use of the occupancy info. In future releases of 3DNA, I am planning to take occupancy/B-factor into consideration, probably through configurable parameters.

Reading through the "Description of Changes and Corrections for PDB July 2011 Remediation Release", I noticed cases of re-corrections of previous corrections in wwPDB remediation efforts. A concrete example is about partial occupancy in the 2009 remediation: among the 490 corrected X-ray and neutron entries, a mistake was made in 104 of them. As a side note, there was a post by Morten Kjeldgaard, titled "PDB changes data in entries?" early this year in CCP4BB, where partial occupancy was used as an example.

Sunday, June 19, 2011

Sugar pucker correlates with phosphorus-related distance

The sugar puckers in DNA/RNA structures are predominately in either C3'-endo or C2'-endo (see Figure below), corresponding to the A- or B-form conformation in a DNA duplex.

Recently, I (re-)read a few articles related to the RNA backbone by Jane Richardson et al., including

"RNA backbone is rotameric" (PNAS 2003)
"RNA backbone: consensus all-angle conformers and modular string nomenclature" (RNA 2008)
"MolProbity: all-atom structure validation for macromolecular crystallography" (Acta Crystallogr D Biol Crystallogr. 2010)
"PHENIX: a comprehensive Python-based system for macromolecular structure solution" (Acta Crystallogr D Biol Crystallogr. 2010)

I somehow became interested in the correlation between sugar pucker and a simple distance parameter, as reported in these papers:

C3'-endo and C2'-endo sugar puckers are highly correlated to the perpendicular distance between the C1'–N1/9 glycosidic bond vector and the following phosphate: > 2.9 Å for C3'-endo and < 2.9 Å for C2'-endo. (p.16 of the MolProbity paper)

Out of curiosity and to get a better understanding of this correlation, I played around with some sample cases both visually in RasMol and numerically. Overall, this is a simple geometric problem, i.e., the shortest distance from a point to a line in 3-dimensional space. Given below is the Octave/Matlab script for calculating the distances for G175 and U176 of PDB entry 1JJ2 (the large ribosomal subunit of Haloarcula marismortui):

function d = get_p3_nc_dist(P3, C1, N)
    N_C1 = N - C1;                 # vector from N to C1'
    nv_N_C1 = N_C1 / norm(N_C1);   # normalized vector
    C1_P3 = P3 - C1;               # vector from C1 to P3
    proj = dot(C1_P3, nv_N_C1);
    d  = norm(C1_P3 - proj * nv_N_C1);
end

## G175 (1jj2)
P3 = [70.104 112.366  44.586];
C1 = [73.017 109.666  45.304];
N = [74.445 109.380  45.288];
d1 = get_p3_nc_dist(P3, C1, N)    # 2.2 Å -- C2'-endo

## U176 (1jj2)
P3 = [66.871 116.402  46.804];
C1 = [68.213 112.454  49.279];
N = [69.678 112.480  49.438];
d2 = get_p3_nc_dist(P3, C1, N)    # 4.6 Å -- C3'-endo

The GpU used in the above example forms a dinucleotide platform, where the sugar of G175 adopts a C2'-endo conformation, and that of U176 has C3'-endo. Indeed, the distance for the G175 nucleotide is 2.2 Å, less than 2.9 Å; whilst the value for U176 is 4.6 Å, greater than 2.9 Å.

It is worth noting the above mentioned articles from Richardson et al. are focused on RNA backbone, without paying attention to the base (pair) geometry. The Zp parameter, which quantifies the z-coordinate of the phosphorus atom in the mean reference frame (see "A-form conformational motifs in ligand-bound DNA structures", JMB 2000), can be easily adapted to the analysis of single stranded RNA structures. For example, the vertical distances of the 3' phosphorus atoms to the G175 and U176 base planes are 1.9 Å and 4.4 Å, respectively.

Since base planes and the phosphorus atoms are the most accurately located entities in a given nucleic acid structure, the nucleotide-based Zp variant presumably would have some advantage over the distance from phosphorus to the glycosidic bond. Naturally, this Zp parameter will be added in future releases of 3DNA.

Saturday, June 11, 2011

Conformation of the sugar ring in nucleic acid structures

The conformation of the five-membered sugar ring in DNA/RNA structure can be characterized using the five corresponding endocyclic torsion angles (see Figure below).

i.e.,

v0: C4'-O4'-C1'-C2'
v1: O4'-C1'-C2'-C3'
v2: C1'-C2'-C3'-C4'
v3: C2'-C3'-C4'-O4'
v4: C3'-C4'-O4'-C1'

Due to the ring constraint, the conformation can be characterized approximately by 5 - 3 = 2 parameters. Using the concept of pseudorotation of the sugar ring, the two parameters are the amplitude (τ_m) and phase angle (P).

One set of widely used formula to convert the five torsion angles to the pseudorotation parameters is due to Altona & Sundaralingam: "Conformational Analysis of the Sugar Ring in Nucleosides and Nucleotides. A New Description Using the Concept of Pseudorotation" [J. Am. Chem. Soc., 1972, 94(23), pp 8205–8212]. As always, the concept is best illustrated with an example. Here I use the sugar ring of G4 (chain A) of the Dickerson-Drew dodecamer (1bna/bdl001), with Matlab/Octave code:

# xyz coordinates of the sugar ring: G4 (chain A), 1bna/bdl001
ATOM     63  C4'  DG A   4      21.393  16.960  18.505  1.00 53.00
ATOM     64  O4'  DG A   4      20.353  17.952  18.496  1.00 38.79
ATOM     65  C3'  DG A   4      21.264  16.229  17.176  1.00 56.72
ATOM     67  C2'  DG A   4      20.793  17.368  16.288  1.00 40.81
ATOM     68  C1'  DG A   4      19.716  17.901  17.218  1.00 30.52

# endocyclic torsion angles:
v0 = -26.7; v1 = 46.3; v2 = -47.1; v3 = 33.4; v4 = -4.4
Pconst = sin(pi/5) + sin(pi/2.5)  # 1.5388
P0 = atan2(v4 + v1 - v3 - v0, 2.0 * v2 * Pconst);  # 2.9034
tm = v2 / cos(P0);  # amplitude: 48.469
P = 180/pi * P0;  # phase angle: 166.35 [P + 360 if P0 < 0]

The Altona & Sundaralingam (1972) pseudorotation parameters are what have been adopted in 3DNA. The Curves+ program, however, uses another set of formula due to Westhof & Sundaralingam: "A Method for the Analysis of Puckering Disorder in Five-Membered Rings: The Relative Mobilities of Furanose and Proline Rings and Their Effects on Polynucleotide and Polypeptide Backbone Flexibility" [J. Am. Chem. Soc., 1983, 105(4), pp 970–976]. The two sets of formula by Altona & Sundaralingam (1972) and Westhof & Sundaralingam (1983) give slightly different numerical values for the two pseudorotation parameters (amplitude τ_mand phase angle P).

Since 3DNA and Curves+ are currently the most commonly used programs for conformational analysis of nucleic acid structures, the subtle differences in pseudorotation parameters may cause confusions for users who use both programs. With the same G4 (chain A, 1bna) sugar ring, here is the Matlab/Octave script showing how Curve+ calculates the pseudorotation parameters:

# xyz coordinates of sugar ring G4 (chain A, 1bna/bdl001)

# endocyclic torsion angles, same as above
v0 = -26.7; v1 = 46.3; v2 = -47.1; v3 = 33.4; v4 = -4.4

v = [v2, v3, v4, v0, v1]; # reorder them into vector v[]
A = 0; B = 0;
for i = 1:5
    t = 0.8 * pi * (i - 1);
    A += v(i) * cos(t);
    B += v(i) * sin(t);
end
A *= 0.4;   # -48.476
B *= -0.4;  # 11.516

tm = sqrt(A * A + B * B);  # 49.825

c = A/tm; s = B/tm;
P = atan2(s, c) * 180 / pi;  # 166.64

For this specific example, i.e., the sugar ring G4 (chain A, 1bna/bdl001), the pseudorotation parameters as calculated by 3DNA following Altona & Sundaralingam (1972) and Curves+ following Westhof & Sundaralingam (1983) are as follows:

         amplitude (τ_m)     phase angle (P)
3DNA        48.469             166.35
Curves+     49.825             166.64

Needless to say, the differences are subtle, and few people will notice/bother at all. For those who do care about such little details, however, hopefully this post will help you understand where the differences actually come from.

Sunday, June 5, 2011

Lower case chain identifiers in PDB format

First formulated in early 1970s, the PDB format is rigid with fixed columns for designated contents in its ATOM/HETATM records. Specificlly, a single column, #22, is assigned for the chain identifier (id). Traditionally, the 26 upper case letters of English alphabet (A-Z), space (i.e., ' '), and the single digits (0-9) have been used as chain ids. Up until the ribosomal structures came up, I guess, those 26 + 1 + 10 = 37 characters had been sufficient for the chain ids.

To the best of my knowledge, for a long time, most PDB parsers assume upper case chain ids. Indeed, 3DNA v1.5 automatically converts each ATOM/HETATM records to upper cases. The first time I became aware of lower case chain ids was when I saw a post in the 3DNA forum, titled "Small bug in find_pair", where a user reported the 'w' vs 'W' chain ids in PDB entry 1VSP. Then I refined 3DNA so that the case of chain ids can be preserved, through an undocumented command line option (as a feature for internal testing purpose).

My view to make 3DNA chain ids case-sensitive has been reinforced when I read the article "Crystal structures of CGG RNA repeats with implications for fragile X-associated tremor ataxia syndrome" recently published in Nucleic Acids Research. The asymmetric unit of the unmodified CGG-repeats-containing duplex (GCGGCGGC)₂, NDB entry NA1017 / PDB entry 3R1C, contains a total of 36 chains: designated as A-Z, plus a-j. Without distinguishing cases of the chain ids, the 3DNA output would become quite confusing.

Thus, in future releases of 3DNA, the default would be switched to preserve the case of chain ids. This chain id 'case' serves as an excellent example that scientific software products, unlike publications per se, are not fixed but need continuous care and maintenance to meet the challenges of an evolving world.

Sunday, May 22, 2011

NAR's top ten articles

Recently, I noticed a new feature in the website of Nucleic Acids Research (NAR), i.e., its selection of top ten articles:

NAR’s Top Ten Articles are updated monthly and show recent articles that have been most often accessed in HTML and PDF formats in the specified month.

In the age of information explosion with flood of scientific journals and articles, it is easy get lost. NAR's pick of top ten and featured articles draws my attention to significant work I may otherwise overlook.

The current top ten articles (March 2011) are all selected from 2009/2010 publications in 'Database', 'Methods online', and 'Survey and Summary'. I am browsing the 2009 article by Thomas LaFramboise, "Single nucleotide polymorphism arrays: a decade of biological, computational and technological advances", to get a better understanding of SNPs.

Sunday, May 15, 2011

Posts in the 3DNA forum reach 600

As of May 6, the total number of posts in the 3DNA forum has reached 600. Created in March 2007, with my debut post titled "Welcome message from Xiang-Jun Lu", the forum is now over four years old. Overall, the forum has served its purpose pretty well. In answering questions, I've been increasingly referring to the posts in the forum. As a concrete example, see the thread of a recent question "Base pair step parameters with a missing base pair".

At less than three posts (about one question) per week on average, I've not felt too much stress in supporting the forum (and maintaining 3DNA) in my spare time. For the most part, I've enjoyed interacting with 3DNA users from everywhere in the world, and with diverse backgrounds. Following the Unix philosophy ("Write programs that do one thing and do it well. Write programs to work together."), 3DNA has proved to be robust and flexible in serving its ever-growing user community. As a matter of fact, few questions I received a couple of years ago were beyond my original consideration of the details while I wrote the code. It is this intimated knowledge of all the underlying algorithms and every bit of their implementations that allows me to answer users' questions quickly and concretely.

As time passes by, however, it has become evident to me that 3DNA needs to be further refined and extended to meet the ever changing needs of its user community. For example, over the past few months, several questions asked in the 3DNA forum are directly relevant but clearly beyond 3DNA's current capabilities. While I'd be interested in implementing some of the requested functionality that make sense to me, doing so is certainly over my spare time limit. On the other hand, my increased understanding of nucleic acid structures and accumulated software expertise make it simply an issue of time and effort to move 3DNA to the next level, far beyond its current application scope and impact.

With posts in the 3DNA forum reaching 600, and citations to 3DNA articles over 600 (Google scholar), I am hopeful something good will happen to the 3DNA project. After all, 6 is a lucky number in traditional Chinese culture.

Fifty years of operon

In the latest issue of Science, there is a one-page editorial titled "The Birth of the Operon" by François Jacob, who won the Nobel Prize in Physiology or Medicine in 1965:

What is the operon, whose 50th anniversary is being celebrated this week? The word heralded the discovery of how genes are turned on and off, and it launched the now-immense field of gene regulation. ... we cannot presume to know how new ideas will arise and where scientific research will lead.

In the next three paragraphs, Jacob provides an insightful and vivid description of his research related to the discovery of the "operon" – a structural gene-regulatory gene ensemble. In consonant with his comment on scientific discovery, he concludes:

Our breakthrough was the result of “night science”: a stumbling, wandering exploration of the natural world that relies on intuition as much as it does on the cold, orderly logic of “day science.” In today’s vastly expanded scientific enterprise, obsessed with impact factors and competition, we will need much more night science to unveil the many mysteries that remain about the workings of organisms.

It is worth noting that the Journal of Molecular Biology (JMB) has recently published a special issue [Volume 409, Issue 1, Pages 1-88 (27 May 2011)], titled "The Operon Model and its Impact on Modern Molecular Biology" with historical accounts and reviews to celebrate operon's 50th anniversary. It is because of this event that motivated me to read the Jacob and Monod 1961 JMB review article "Genetic regulatory mechanisms in the synthesis of proteins" – I have come across this paper so many times before, and should have definitely read it long ago!

Curves+ web server

Through Google Scholar, I become aware of the article online in Nucleic Acids Research (NAR), titled "CURVES+ web server for analyzing and visualizing the helical, backbone and groove parameters of nucleic acid structures" by Richard Lavery's group:

Curves+, a revised version of the Curves software for analyzing the conformation of nucleic acid structures, is now available as a web server. This version, which can be freely accessed at http://gbio-pbil.ibcp.fr/cgi/Curves_plus/, allows the user to upload a nucleic acid structure file, choose the nucleotides to be analyzed and after optionally setting a number of input variables, view the numerical and graphic results online or download files containing a set of helical, backbone and groove parameters that fully describe the structure. PDB format files are also provided for offline visualization of the helical axis and groove geometry.

The website looks quite streamlined, with required input information all in a single page, and the test page also ran smoothly. In less than two years following the publication of Curves+, it is nice to see the Curves+ web server version available, making this analysis tool more readily available to the nucleic acids community.

Nowadays, it seems safe (to the best of my knowledge) to say that only 3DNA and Curves+ conform to the 1999 Tsukuba convention for the description of nucleic acid base-pair geometry, and each of them provides a web interface: web 3DNA and web Curves+.

Sunday, May 1, 2011

Scientific journals on nucleic acids

In my knowledge, Nucleic Acids Research (NAR) is a highly respected scientific journal with a broad impact in the field of nucleic acids. Over the years, I have been browsing NAR webpage on a regular basis to keep myself up to date to the latest development in this area. It is thus no surprise that the initial 3DNA paper was submitted to and published in NAR in 2003. Among the 500+ citations to that 3DNA paper, over 1/5 (100+) articles are from NAR itself (as an example, please see my January 22, 2011 blog post titled "Three structural biology papers in the latest issue of NAR cite 3DNA"). My latest contribution to NAR is the GpU story, which was actually selected as a featured article.

Another related journal I am quite familiar with is RNA, a publication of the RNA society. As the "About" section of its webpage succinctly summarizes,

RNA serves as an international forum for publishing original reports on RNA research in the broadest sense. The journal aims to unify this field by cutting across established disciplinary lines and focusing on "RNA-centered" science.

RNA currently has an impact factor (IF) of 5.198 (2009), slightly lower than NAR's 7.479. It is, nevertheless, a very decent journal in RNA-related research, and I frequently visit its website. As a side note, the GpU paper was initially submitted to RNA for its RNA-specific content and as a way to diversify my publication spectrum (as mentioned above, 3DNA was initially published in NAR). Unfortunately, the GpU paper was rejected by the RNA journal after two rounds of review, spanning over 6 months.

Another journal closely related to RNA (name wise) is called RNA Biology, which even has a slightly higher IF of 5.56. Admittedly, I was not familiar with this journal at all. Browsing through its website, I am interested in seeing the journal's explicit policy to reconsider papers "rejected by high impact journals [CNS] for reasons of novelty and impact, rather than the importance of the study or the integrity of the data." By enclosing "the reviewers’ and/or editorial comments" from these high impact journals, "it is possible the article might be accepted [by RNA Biology] based on its previous review. This will allow the urgent and competitive research to be published on the day of submission."

I became aware of the journal DNA Research quite recently through an email. From its website, "DNA Research is an internationally peer-reviewed journal which aims at publishing papers of highest quality in broad aspects of DNA and genome-related research." The journal currently has an IF of 4.917. Browsing a couple of its online issues, I sense that the journal is more on genome- than structure-related research.

While following up 3DNA citations recently, I noticed the paper titled "Insights into the Structures of DNA Damaged by Hydroxyl Radical: Crystal Structures of DNA Duplexes Containing 5-Formyluracil" by Tsunoda and Taknaka. It was published in the Journal of Nucleic Acids, which I have never (but probably should have) heard of before. From its website, "Journal of Nucleic Acids is a peer-reviewed, open access journal that publishes original research articles as well as review articles in all areas of nucleic acids." By virtue of this structure paper and its citation to 3DNA, I think the journal is surely of personal interest, and I have added it into my watch-list.

To sum up, there are currently four scientific journals (I know of) that are devoted to nucleic acids:

Do I still miss something? Please make your suggestion in the comment area.

[revised on May 17, 2011 by adding RNA Biology]