Xiang-Jun's Corner

Sunday, October 30, 2011

3DNA base-mutation functionality for the study of pretein-DNA interactions

In my last blog post titled "mutate_bases, a new 3DNA tool for in silico base mutations of 3D nucleic acid structures", I outlined the availability of the mutate_bases program and some areas of its possible applications, including to "perform base-pair mutations in DNA-protein complexes".

In the September 6, 2011 issue of PNAS (vol. 108, no. 36), AlQuraishi and McAdamsa from Stanford University published a high-profile article "Direct inference of protein–DNA interactions using compressed sensing methods" (pp. 14819–14824). See also the thoughtful Commentary by Vijay Pande, "(Compressed) sensing and sensibility". Essentially, by combining concepts from compressed sensing and statistical mechanics, AlQuraishi and McAdamsa have developed a novel approach to determine the energy potential of protein–DNA complexes, leading to "an impressive advance in predictive capability" -- ~90% accuracy compared with ~60% for the best-performing alternative computational methods.

I am so glad to find that 3DNA (ref nos. 12 and 13) was used in the work:

In silico mutagenesis was carried out using the 3DNA software package (12, 13), which maintains the backbone atoms of the DNA molecule, but replaces the base pair atoms in a way that is consistent with the backbone orientation in the crystal. (p.14821, middle of the right column)

Hopefully, the new and novel base mutation functionality in 3DNA will find more applications and be in wider use. Surely, I'll respond promptly to users' feedback and refine related 3DNA tools as necessary.

Tuesday, October 25, 2011

mutate_bases, a new 3DNA tool for in silico base mutations of 3D nucleic acid structures

In response to repeated requests from 3DNA users over the years, I recently (well, a few months ago) released a new utility program, mutate_bases, to the 3DNA suite of software package. As its name implies, mutate_bases can be used for mutating bases in a nucleic acid structure, with two key and unique features: (1) the sugar-phosphate backbone conformation is untouched; (2) the base reference frame (position and orientation) is reserved, i.e., the mutated structure shares the same base-pair/step parameters as the original one.

mutate_bases is a standalone ANSI C program, on a par with other major 3DNA programs (e.g., find_pair, analyze, rebuild, fiber etc). As seen from the help message below, it can be used for any nucleic-acid-containing structures (DNA, RNA, or their complexes) in PDB format:

------------------------------------------------------------------------
Usage: mutate_bases mutinfo pdbfile outfile
    'mutinfo' can contain upto 5 fields for each mutation:
        [name=residue_name] [icode=insertion_code]
        chain=chain_id seqnum=residue_number
        mutation=residue_name
    Alternatively, 'mutinfo' can be specified with the '-l' option
        followed by a mutation file: '-l mutations.dat'
    o The five fields per mutation can be in any order or CaSe
    o Each field can be abbreviated to its first character
    o Multiple mutations are separated by ';'
    o Fields in [] (i.e., name and icode) are optional
    o Mutation info should be QUOTED to be taken as one entry

Examples:
    mutate_bases 'c=a s=2 m=DA' 355d.pdb 355d_G2A.pdb
    # mutate G2 in chain A of B-DNA 355d to A
    mutate_bases 'c=a s=2 m=DA; c=B s=23 m=DT' 355d.pdb 355d_mutfile.pdb
    # mutate base-pair G-C (C23 in chain B) to A-T
        # also create file 'mutations.dat', see below
    mutate_bases -l mutations.dat 355d.pdb 355d_mutfile.pdb
        # 355d_mutfile.pdb and 355d_mutfile.pdb would be identical
    mutate_bases 'c=A s=74 m=U' 1evv.pdb 1evv_C74U.pdb
    # mutate C74 in chain A of tRNA 1evv to U
------------------------------------------------------------------------

mutate_bases is designed to solve the base mutation problem in a practical sense: robust and efficient, getting its job done and then out of the way. The program can have many possible applications: in addition to perform base-pair mutations in DNA-protein complexes, it should also prove handy RNA modeling and in providing initial structures for QM/MM/MD energy calculations, and in DNA/RNA modeling studies.

Friday, July 29, 2011

Tracking 3DNA citations: Google scholar vs. Web of Science

Over the years, I have been following citations to 3DNA on a regular basis, using both Google Scholar and Web of Science. From a personal perspective, following the citations has turned out to be an excellent way to keep myself informed of progress related to nucleic acid structures.

Up to early this year, I had found 3DNA citations based on Google Scholar to be significantly larger than those from Web of Science. For example, for the 2003 Nucleic Acids Research (NAR) paper, the difference was well over one hundred. Over the past few months, however, I have noticed that citation numbers to the 2003 NAR paper fluctuating up and down in dozens. Moreover, I have been receiving significantly less email notifications to 3DNA citations from Google Scholar. Apparently, Google has been revising the algorithms for Google Scholar, leading to more conservative citation numbers that are now comparable to those from Web of Science.

Here are the current citation numbers to the three 3DNA publications, from both Google Scholar and Web of Science:

	Google Scholar	Web of Science
2003 NAR	544	473
2008 Nature Protocols	43	41
2009 NAR (w3DNA)	11	13

It is interesting to note that for the w3DNA paper, Web of Science gives a larger number (13) than Google Scholar. A careful check of the citation result from Web of Science shows that all of the 13 citations are legitimate journal articles. Clearly, Web of Science is missing something in this case.

Overall, I trust Web of Science more, which is why I have been using it to compile citations to the 2003 NAR paper. However, I use Google Scholar more frequently for quick reference simply because it is free and easy to access.

Thursday, July 14, 2011

GpU dinucleotide platform, the smallest unit with key RNA structural features

Compared to DNA, RNA has three salient structural features: it contains ribose sugar, uracil, and is normally single-standed. The O2'(G)...O2P(U) H-bond stabilized GpU dinucleotide platform may turn out to be the smallest unit with all those RNA hallmarks (see Figure below).

Firstly, it must have the guanosine ribose to form the O2'(G)...O2P(U) H-bond.

Secondly, the methyl group in position 5 of thymine would cause steric clash with guanosine, thus disrupting the N2(G)...O4(U) base-base H-bond to form the GpU dinucleotide platform.

Thirdly, a dinucleotide, by definition, is single-standed. The two H-bonds, plus the covalent linkage, makes the GpU platform extremely rigid (see Figure 1 of our 2010 NAR paper).

Moreover, the GpU platform is directional: swapping the two bases while keeping the sugar-phosphate backbone fixed does not allow for a base-base H-bond, thus no UpG dinucleotide platform.

Sunday, June 26, 2011

PDB v3.3 and partial atom occupancy

From the PDB mailing list, I know of the recent announcement "PDB Archive Version 4.0 to be Released July 13, 2011". This "ambitious" review of the PDB archive has resulted in a new set of corrected files in ten categories, including biological assemblies, residual B factors, peptide inhibitors/antibiotics, polymers containing nonstandard polymer linkages, and partial occupancy etc. "These data reflect the wwPDB's continuing commitment to providing accurate and detailed data to users worldwide."

I am interested in changes in the PDB format, and read the PDF document "Description of Changes and Corrections for PDB July 2011 Remediation Release":

PDB format files updated in this remediation release comply with PDB Format Version 3.30. PDBx and PDBML data files comply with the PDB Exchange Dictionary v.4.0, and PDBML XML Schema V4.0, respectively.

Specifically, I checked carefully the section on "Partial occupancy", which is quoted in full below:

Problem

In the 2009 remediation, occupancies were corrected in 490 X-ray
and neutron entries. A mistake was made in 104 of these entries:
for atoms with alternate conformer labels and with summed total
occupancy less than 1.0, the occupancies were re-scaled as 1.0/n,
where n is the number of conformers.

Approach

The originally deposited occupancies of the affected atoms were
restored and the remediation was then carried out properly, via:

Atoms with multiple conformations but identical coordinates and B-values were merged and their occupancies were summed.

Atoms which now have (total) occupancies <= 1.0 were left as deposited.

Atoms with (total) occupancies > 1 were rescaled proportionally to a sum of 1.0

Results

The occupancies have been corrected in these entries.

This partial occupancy issue reminds me of an extensive and very informative discussion a few months back in CCP4BB, under the thread "what to do with disordered side chains" and its derivatives, about setting "zero" occupancy and/or high B values for PDB ATOM/HETATM records in disordered regions. Over the course of the discussion, Frances Bernstein made the following comment:

I am absolutely positive that there is software that does its voodoo on ATOM/HETATM records and pays absolutely no attention to anything beyond the x, y, z coordinates (i.e. beyond column 54).

3DNA does pay attention beyond column 54 (up to 80, actually) for ATOM/HETATM records, but internally it does not make use of the occupancy info. In future releases of 3DNA, I am planning to take occupancy/B-factor into consideration, probably through configurable parameters.

Reading through the "Description of Changes and Corrections for PDB July 2011 Remediation Release", I noticed cases of re-corrections of previous corrections in wwPDB remediation efforts. A concrete example is about partial occupancy in the 2009 remediation: among the 490 corrected X-ray and neutron entries, a mistake was made in 104 of them. As a side note, there was a post by Morten Kjeldgaard, titled "PDB changes data in entries?" early this year in CCP4BB, where partial occupancy was used as an example.

Sunday, June 19, 2011

Sugar pucker correlates with phosphorus-related distance

The sugar puckers in DNA/RNA structures are predominately in either C3'-endo or C2'-endo (see Figure below), corresponding to the A- or B-form conformation in a DNA duplex.

Recently, I (re-)read a few articles related to the RNA backbone by Jane Richardson et al., including

"RNA backbone is rotameric" (PNAS 2003)
"RNA backbone: consensus all-angle conformers and modular string nomenclature" (RNA 2008)
"MolProbity: all-atom structure validation for macromolecular crystallography" (Acta Crystallogr D Biol Crystallogr. 2010)
"PHENIX: a comprehensive Python-based system for macromolecular structure solution" (Acta Crystallogr D Biol Crystallogr. 2010)

I somehow became interested in the correlation between sugar pucker and a simple distance parameter, as reported in these papers:

C3'-endo and C2'-endo sugar puckers are highly correlated to the perpendicular distance between the C1'–N1/9 glycosidic bond vector and the following phosphate: > 2.9 Å for C3'-endo and < 2.9 Å for C2'-endo. (p.16 of the MolProbity paper)

Out of curiosity and to get a better understanding of this correlation, I played around with some sample cases both visually in RasMol and numerically. Overall, this is a simple geometric problem, i.e., the shortest distance from a point to a line in 3-dimensional space. Given below is the Octave/Matlab script for calculating the distances for G175 and U176 of PDB entry 1JJ2 (the large ribosomal subunit of Haloarcula marismortui):

function d = get_p3_nc_dist(P3, C1, N)
    N_C1 = N - C1;                 # vector from N to C1'
    nv_N_C1 = N_C1 / norm(N_C1);   # normalized vector
    C1_P3 = P3 - C1;               # vector from C1 to P3
    proj = dot(C1_P3, nv_N_C1);
    d  = norm(C1_P3 - proj * nv_N_C1);
end

## G175 (1jj2)
P3 = [70.104 112.366  44.586];
C1 = [73.017 109.666  45.304];
N = [74.445 109.380  45.288];
d1 = get_p3_nc_dist(P3, C1, N)    # 2.2 Å -- C2'-endo

## U176 (1jj2)
P3 = [66.871 116.402  46.804];
C1 = [68.213 112.454  49.279];
N = [69.678 112.480  49.438];
d2 = get_p3_nc_dist(P3, C1, N)    # 4.6 Å -- C3'-endo

The GpU used in the above example forms a dinucleotide platform, where the sugar of G175 adopts a C2'-endo conformation, and that of U176 has C3'-endo. Indeed, the distance for the G175 nucleotide is 2.2 Å, less than 2.9 Å; whilst the value for U176 is 4.6 Å, greater than 2.9 Å.

It is worth noting the above mentioned articles from Richardson et al. are focused on RNA backbone, without paying attention to the base (pair) geometry. The Zp parameter, which quantifies the z-coordinate of the phosphorus atom in the mean reference frame (see "A-form conformational motifs in ligand-bound DNA structures", JMB 2000), can be easily adapted to the analysis of single stranded RNA structures. For example, the vertical distances of the 3' phosphorus atoms to the G175 and U176 base planes are 1.9 Å and 4.4 Å, respectively.

Since base planes and the phosphorus atoms are the most accurately located entities in a given nucleic acid structure, the nucleotide-based Zp variant presumably would have some advantage over the distance from phosphorus to the glycosidic bond. Naturally, this Zp parameter will be added in future releases of 3DNA.

Saturday, June 11, 2011

Conformation of the sugar ring in nucleic acid structures

The conformation of the five-membered sugar ring in DNA/RNA structure can be characterized using the five corresponding endocyclic torsion angles (see Figure below).

i.e.,

v0: C4'-O4'-C1'-C2'
v1: O4'-C1'-C2'-C3'
v2: C1'-C2'-C3'-C4'
v3: C2'-C3'-C4'-O4'
v4: C3'-C4'-O4'-C1'

Due to the ring constraint, the conformation can be characterized approximately by 5 - 3 = 2 parameters. Using the concept of pseudorotation of the sugar ring, the two parameters are the amplitude (τ_m) and phase angle (P).

One set of widely used formula to convert the five torsion angles to the pseudorotation parameters is due to Altona & Sundaralingam: "Conformational Analysis of the Sugar Ring in Nucleosides and Nucleotides. A New Description Using the Concept of Pseudorotation" [J. Am. Chem. Soc., 1972, 94(23), pp 8205–8212]. As always, the concept is best illustrated with an example. Here I use the sugar ring of G4 (chain A) of the Dickerson-Drew dodecamer (1bna/bdl001), with Matlab/Octave code:

# xyz coordinates of the sugar ring: G4 (chain A), 1bna/bdl001
ATOM     63  C4'  DG A   4      21.393  16.960  18.505  1.00 53.00
ATOM     64  O4'  DG A   4      20.353  17.952  18.496  1.00 38.79
ATOM     65  C3'  DG A   4      21.264  16.229  17.176  1.00 56.72
ATOM     67  C2'  DG A   4      20.793  17.368  16.288  1.00 40.81
ATOM     68  C1'  DG A   4      19.716  17.901  17.218  1.00 30.52

# endocyclic torsion angles:
v0 = -26.7; v1 = 46.3; v2 = -47.1; v3 = 33.4; v4 = -4.4
Pconst = sin(pi/5) + sin(pi/2.5)  # 1.5388
P0 = atan2(v4 + v1 - v3 - v0, 2.0 * v2 * Pconst);  # 2.9034
tm = v2 / cos(P0);  # amplitude: 48.469
P = 180/pi * P0;  # phase angle: 166.35 [P + 360 if P0 < 0]

The Altona & Sundaralingam (1972) pseudorotation parameters are what have been adopted in 3DNA. The Curves+ program, however, uses another set of formula due to Westhof & Sundaralingam: "A Method for the Analysis of Puckering Disorder in Five-Membered Rings: The Relative Mobilities of Furanose and Proline Rings and Their Effects on Polynucleotide and Polypeptide Backbone Flexibility" [J. Am. Chem. Soc., 1983, 105(4), pp 970–976]. The two sets of formula by Altona & Sundaralingam (1972) and Westhof & Sundaralingam (1983) give slightly different numerical values for the two pseudorotation parameters (amplitude τ_mand phase angle P).

Since 3DNA and Curves+ are currently the most commonly used programs for conformational analysis of nucleic acid structures, the subtle differences in pseudorotation parameters may cause confusions for users who use both programs. With the same G4 (chain A, 1bna) sugar ring, here is the Matlab/Octave script showing how Curve+ calculates the pseudorotation parameters:

# xyz coordinates of sugar ring G4 (chain A, 1bna/bdl001)

# endocyclic torsion angles, same as above
v0 = -26.7; v1 = 46.3; v2 = -47.1; v3 = 33.4; v4 = -4.4

v = [v2, v3, v4, v0, v1]; # reorder them into vector v[]
A = 0; B = 0;
for i = 1:5
    t = 0.8 * pi * (i - 1);
    A += v(i) * cos(t);
    B += v(i) * sin(t);
end
A *= 0.4;   # -48.476
B *= -0.4;  # 11.516

tm = sqrt(A * A + B * B);  # 49.825

c = A/tm; s = B/tm;
P = atan2(s, c) * 180 / pi;  # 166.64

For this specific example, i.e., the sugar ring G4 (chain A, 1bna/bdl001), the pseudorotation parameters as calculated by 3DNA following Altona & Sundaralingam (1972) and Curves+ following Westhof & Sundaralingam (1983) are as follows:

         amplitude (τ_m)     phase angle (P)
3DNA        48.469             166.35
Curves+     49.825             166.64

Needless to say, the differences are subtle, and few people will notice/bother at all. For those who do care about such little details, however, hopefully this post will help you understand where the differences actually come from.

Sunday, June 5, 2011

Lower case chain identifiers in PDB format

First formulated in early 1970s, the PDB format is rigid with fixed columns for designated contents in its ATOM/HETATM records. Specificlly, a single column, #22, is assigned for the chain identifier (id). Traditionally, the 26 upper case letters of English alphabet (A-Z), space (i.e., ' '), and the single digits (0-9) have been used as chain ids. Up until the ribosomal structures came up, I guess, those 26 + 1 + 10 = 37 characters had been sufficient for the chain ids.

To the best of my knowledge, for a long time, most PDB parsers assume upper case chain ids. Indeed, 3DNA v1.5 automatically converts each ATOM/HETATM records to upper cases. The first time I became aware of lower case chain ids was when I saw a post in the 3DNA forum, titled "Small bug in find_pair", where a user reported the 'w' vs 'W' chain ids in PDB entry 1VSP. Then I refined 3DNA so that the case of chain ids can be preserved, through an undocumented command line option (as a feature for internal testing purpose).

My view to make 3DNA chain ids case-sensitive has been reinforced when I read the article "Crystal structures of CGG RNA repeats with implications for fragile X-associated tremor ataxia syndrome" recently published in Nucleic Acids Research. The asymmetric unit of the unmodified CGG-repeats-containing duplex (GCGGCGGC)₂, NDB entry NA1017 / PDB entry 3R1C, contains a total of 36 chains: designated as A-Z, plus a-j. Without distinguishing cases of the chain ids, the 3DNA output would become quite confusing.

Thus, in future releases of 3DNA, the default would be switched to preserve the case of chain ids. This chain id 'case' serves as an excellent example that scientific software products, unlike publications per se, are not fixed but need continuous care and maintenance to meet the challenges of an evolving world.

Sunday, May 22, 2011

NAR's top ten articles

Recently, I noticed a new feature in the website of Nucleic Acids Research (NAR), i.e., its selection of top ten articles:

NAR’s Top Ten Articles are updated monthly and show recent articles that have been most often accessed in HTML and PDF formats in the specified month.

In the age of information explosion with flood of scientific journals and articles, it is easy get lost. NAR's pick of top ten and featured articles draws my attention to significant work I may otherwise overlook.

The current top ten articles (March 2011) are all selected from 2009/2010 publications in 'Database', 'Methods online', and 'Survey and Summary'. I am browsing the 2009 article by Thomas LaFramboise, "Single nucleotide polymorphism arrays: a decade of biological, computational and technological advances", to get a better understanding of SNPs.