Sunday, March 3, 2013

DSSR: Software for Defining the (Secondary) Structures of RNA

As the number of experimentally solved RNA-containing structures grows, it is becoming increasingly important to characterize the geometric features of the molecules in a consistent and efficient manner. Here comes DSSR, a software for Defining the (Secondary) Structures of RNA. See the post titled "DSSR, what's it and why bother?" for more info.

Starting from an RNA structure in PDB format, DSSR employs a set of simple rules to identify all existent base pairs (bp): both canonical Watson–Crick pairs and non-canonical pairs with one or more H-bonds, made up of normal or modified bases, regardless of tautomeric or protonation state. The classification is based on the six standard rigid-body bp parameters (shear, stretch, stagger, propeller, buckle, and opening), which together rigorously quantify the spatial disposition of the two interacting bases. The program characterizes each bp not only by common pattern names (Watson-Crick, wobble, Hoogsteen, sheared, dinucleotide platform), but also according to the Saenger classification scheme of 28 types and the Leontis-Westhof nomenclature of 12 basic geometric classes.

DSSR detects triplets, quadruplets, and even higher-order base associations by searching horizontally in the planes of the associated bases for further H-bonding interactions. The program determines helical regions by exploring each bp’s neighborhood vertically for base-stacking interactions regardless of backbone connection (e.g., coaxial stacking of helices). DSSR calculates commonly used backbone (including the virtual eta/theta) torsion angles, identifies A-minor interactions (type I and II), G quartets, and bulges, internal loops and multi-branch loops. It also detects the existence of pseudo-knots, and outputs RNA secondary structure in dot-bracket notation.

The program is implemented in ANSI C as a stand-alone command-line program with zero dependencies on third-party libraries. No setup is necessary; simply put it into a directory of your choice (preferably one in your command PATH), and it will run. DSSR is a new component of the 3DNA suite of programs (http://x3dna.org) for the analysis, rebuilding, and visualization of three‐dimensional nucleic acid structures.

DSSR is currently released in beta testing version; nevertheless, it should already be robust and efficient enough for real-world applications. I am actively improving the software, and would greatly appreciate your feedback. Give it a try, and report back on the 3DNA Forum any issues you have!

Friday, March 9, 2012

x3dna.org -- 3DNA's new home

It has been quite a while since my last blog post titled "3DNA base-mutation functionality for the study of pretein-DNA interactions" (dated Sunday, October 30, 2011). Over the past few months, I have been extremely busy laying the groundwork for 3DNA: both in software consolidation and hardware setup.

Here are some background details about what happened:
  • 3DNA now has a dedicated domain name x3dna.org. I tried to register with the obvious choice 3dna.org, but found that it has unfortunately been taken already. We thought about several alternatives, and finally settled with the domain name x3dna.org.

    This site is the new home of 3DNA, where -- instead of in the blog -- I will post related news and my personal views. Some of the 3DNA-related posts originally published here are being migrated into the "Highlights" section of x3dna.org. A side benefit of the switch is that 3DNA users from China and other regions where Google blogger is blocked can have access to the content.

    The site x3dna.org is powered by Textpattern, "a flexible, elegant and easy-to-use CMS". I shopped around for a lightweight CMS to host the new 3DNA homepage. I came across Textpatten and was impressed by the mere 484KB download size (v4.4.1). My positive experience over the past few months has made me love the CMS better.

  • The forum is now hosted under the subdomain forum.x3dna.org, powered by Simple Machines Forum (SMF). While I have been happy with phpBB3, I took the migration opportunity to try something new and found SMF a better option for the purpose of 3DNA. The forum site is where the documentation is located and  my interactions with the user community take place.

  • As a result of the new homepage x3dna.org and forum site forum.x3dna.org, the following three 3DNA-related sites previously hosted at Rutgers University have been retired; links to them are automatically redirected to the corresponding new locations:
    • The former v1.5 site at http://rutchem.rutgers.edu/~xiangjun/3DNA. Over the years, this site has received the most outside links.
    • The v2.0 site at http://3dna.rutgers.edu/, which accompanies the 2008 3DNA Nature Protocols paper.
    • The former forum site at http://3dna.rutgers.edu:8080/forum.
To summarize, now 3DNA has a new home at x3dna.org and forum site at forum.x3dna.org. With the infrastructure in shape, I can switch gears to focus on the scientific part of the 3DNA project: refinement, further development of its functionality, and better documentation.

Sunday, October 30, 2011

3DNA base-mutation functionality for the study of pretein-DNA interactions

In my last blog post titled "mutate_bases, a new 3DNA tool for in silico base mutations of 3D nucleic acid structures", I outlined the availability of the mutate_bases program and some areas of its possible applications, including to "perform base-pair mutations in DNA-protein complexes".

In the September 6, 2011 issue of PNAS (vol. 108, no. 36), AlQuraishi and McAdamsa from Stanford University published a high-profile article "Direct inference of protein–DNA interactions using compressed sensing methods" (pp. 14819–14824). See also the thoughtful Commentary by Vijay Pande, "(Compressed) sensing and sensibility". Essentially, by combining concepts from compressed sensing and statistical mechanics, AlQuraishi and McAdamsa have developed a novel approach to determine the energy potential of protein–DNA complexes, leading to "an impressive advance in predictive capability" -- ~90% accuracy compared with ~60% for the best-performing alternative computational methods.

I am so glad to find that 3DNA (ref nos. 12 and 13) was used in the work:
In silico mutagenesis was carried out using the 3DNA software package (12, 13), which maintains the backbone atoms of the DNA molecule, but replaces the base pair atoms in a way that is consistent with the backbone orientation in the crystal. (p.14821, middle of the right column)

Hopefully, the new and novel base mutation functionality in 3DNA will find more applications and be in wider use. Surely, I'll respond promptly to users' feedback and refine related 3DNA tools as necessary.

Tuesday, October 25, 2011

mutate_bases, a new 3DNA tool for in silico base mutations of 3D nucleic acid structures

In response to repeated requests from 3DNA users over the years, I recently (well, a few months ago) released a new utility program, mutate_bases, to the 3DNA suite of software package. As its name implies, mutate_bases can be used for mutating bases in a nucleic acid structure, with two key and unique features: (1) the sugar-phosphate backbone conformation is untouched; (2) the base reference frame (position and orientation) is reserved, i.e., the mutated structure shares the same base-pair/step parameters as the original one.

mutate_bases is a standalone ANSI C program, on a par with other major 3DNA programs (e.g., find_pair, analyze, rebuild, fiber etc). As seen from the help message below, it can be used for any nucleic-acid-containing structures (DNA, RNA, or their complexes) in PDB format:
------------------------------------------------------------------------
Usage: mutate_bases mutinfo pdbfile outfile
    'mutinfo' can contain upto 5 fields for each mutation:
        [name=residue_name] [icode=insertion_code]
        chain=chain_id seqnum=residue_number
        mutation=residue_name
    Alternatively, 'mutinfo' can be specified with the '-l' option
        followed by a mutation file: '-l mutations.dat'
    o The five fields per mutation can be in any order or CaSe
    o Each field can be abbreviated to its first character
    o Multiple mutations are separated by ';'
    o Fields in [] (i.e., name and icode) are optional
    o Mutation info should be QUOTED to be taken as one entry

Examples:
    mutate_bases 'c=a s=2 m=DA' 355d.pdb 355d_G2A.pdb
    # mutate G2 in chain A of B-DNA 355d to A
    mutate_bases 'c=a s=2 m=DA; c=B s=23 m=DT' 355d.pdb 355d_mutfile.pdb
    # mutate base-pair G-C (C23 in chain B) to A-T
        # also create file 'mutations.dat', see below
    mutate_bases -l mutations.dat 355d.pdb 355d_mutfile.pdb
        # 355d_mutfile.pdb and 355d_mutfile.pdb would be identical
    mutate_bases 'c=A s=74 m=U' 1evv.pdb 1evv_C74U.pdb
    # mutate C74 in chain A of tRNA 1evv to U
------------------------------------------------------------------------

mutate_bases is designed to solve the base mutation problem in a practical sense: robust and efficient, getting its job done and then out of the way. The program can have many possible applications: in addition to perform base-pair mutations in DNA-protein complexes, it should also prove handy RNA modeling and in providing initial structures for QM/MM/MD energy calculations, and in DNA/RNA modeling studies.

Friday, July 29, 2011

Tracking 3DNA citations: Google scholar vs. Web of Science

Over the years, I have been following citations to 3DNA on a regular basis, using both Google Scholar and Web of Science. From a personal perspective, following the citations has turned out to be an excellent way to keep myself informed of progress related to nucleic acid structures.

Up to early this year, I had found 3DNA citations based on Google Scholar to be significantly larger than those from Web of Science. For example, for the 2003 Nucleic Acids Research (NAR) paper, the difference was well over one hundred. Over the past few months, however, I have noticed that citation numbers to the 2003 NAR paper fluctuating up and down in dozens. Moreover, I have been receiving significantly less email notifications to 3DNA citations from Google Scholar. Apparently, Google has been revising the algorithms for Google Scholar, leading to more conservative citation numbers that are now comparable to those from Web of Science.

Here are the current citation numbers to the three 3DNA publications, from both Google Scholar and Web of Science:

Google ScholarWeb of Science
2003 NAR544473
2008 Nature Protocols4341
2009 NAR (w3DNA)1113

It is interesting to note that for the w3DNA paper, Web of Science gives a larger number (13) than Google Scholar. A careful check of the citation result from Web of Science shows that all of the 13 citations are legitimate journal articles. Clearly, Web of Science is missing something in this case.

Overall, I trust Web of Science more, which is why I have been using it to compile citations to the 2003 NAR paper. However, I use Google Scholar more frequently for quick reference simply because it is free and easy to access.

Thursday, July 14, 2011

GpU dinucleotide platform, the smallest unit with key RNA structural features

Compared to DNA, RNA has three salient structural features: it contains ribose sugar, uracil, and is normally single-standed. The O2'(G)...O2P(U) H-bond stabilized GpU dinucleotide platform may turn out to be the smallest unit with all those RNA hallmarks (see Figure below).


Firstly, it must have the guanosine ribose to form the O2'(G)...O2P(U) H-bond.

Secondly, the methyl group in position 5 of thymine would cause steric clash with guanosine, thus disrupting the N2(G)...O4(U) base-base H-bond to form the GpU dinucleotide platform.


Thirdly, a dinucleotide, by definition, is single-standed. The two H-bonds, plus the covalent linkage, makes the GpU platform extremely rigid (see Figure 1 of our 2010 NAR paper).

Moreover, the GpU platform is directional: swapping the two bases while keeping the sugar-phosphate backbone fixed does not allow for a base-base H-bond, thus no UpG dinucleotide platform.

Sunday, June 26, 2011

PDB v3.3 and partial atom occupancy

From the PDB mailing list, I know of the recent announcement "PDB Archive Version 4.0 to be Released July 13, 2011". This "ambitious" review of the PDB archive has resulted in a new set of corrected files in ten categories, including biological assemblies, residual B factors, peptide inhibitors/antibiotics, polymers containing nonstandard polymer linkages, and partial occupancy etc. "These data reflect the wwPDB's continuing commitment to providing accurate and detailed data to users worldwide."

I am interested in changes in the PDB format, and read the PDF document "Description of Changes and Corrections for PDB July 2011 Remediation Release":
PDB format files updated in this remediation release comply with PDB Format Version 3.30. PDBx and PDBML data files comply with the PDB Exchange Dictionary v.4.0, and PDBML XML Schema V4.0, respectively.
Specifically, I checked carefully the section on "Partial occupancy", which is quoted in full below:
Problem

In the 2009 remediation, occupancies were corrected in 490 X-ray
and neutron entries. A mistake was made in 104 of these entries:
for atoms with alternate conformer labels and with summed total
occupancy less than 1.0, the occupancies were re-scaled as 1.0/n,
where n is the number of conformers.

Approach

The originally deposited occupancies of the affected atoms were
restored and the remediation was then carried out properly, via:
  • Atoms with multiple conformations but identical coordinates and B-values were merged and their occupancies were summed.
  • Atoms which now have (total) occupancies <= 1.0 were left as deposited.
  • Atoms with (total) occupancies > 1 were rescaled proportionally to a sum of 1.0
Results

The occupancies have been corrected in these entries.

This partial occupancy issue reminds me of an extensive and very informative discussion a few months back in CCP4BB, under the thread "what to do with disordered side chains" and its derivatives, about setting "zero" occupancy and/or high B values for PDB ATOM/HETATM records in disordered regions. Over the course of the discussion, Frances Bernstein made the following comment:
I am absolutely positive that there is software that does its voodoo on ATOM/HETATM records and pays absolutely no attention to anything beyond the x, y, z coordinates (i.e. beyond column 54).
3DNA does pay attention beyond column 54 (up to 80, actually) for ATOM/HETATM records, but internally it does not make use of the occupancy info. In future releases of 3DNA, I am planning to take occupancy/B-factor into consideration, probably through configurable parameters.

Reading through the "Description of Changes and Corrections for PDB July 2011 Remediation Release", I noticed cases of re-corrections of previous corrections in wwPDB remediation efforts. A concrete example is about partial occupancy in the 2009 remediation: among the 490 corrected X-ray and neutron entries, a mistake was made in 104 of them. As a side note, there was a post by Morten Kjeldgaard, titled "PDB changes data in entries?" early this year in CCP4BB, where partial occupancy was used as an example.

Sunday, June 19, 2011

Sugar pucker correlates with phosphorus-related distance

The sugar puckers in DNA/RNA structures are predominately in either C3'-endo or C2'-endo (see Figure below), corresponding to the A- or B-form conformation in a DNA duplex.


Recently, I (re-)read a few articles related to the RNA backbone by Jane Richardson et al., including
I somehow became interested in the correlation between sugar pucker and a simple distance parameter, as reported in these papers:
C3'-endo and C2'-endo sugar puckers are highly correlated to the perpendicular distance between the C1'–N1/9 glycosidic bond vector and the following phosphate: > 2.9 Å for C3'-endo and < 2.9 Å for C2'-endo. (p.16 of the MolProbity paper)

Out of curiosity and to get a better understanding of this correlation, I played around with some sample cases both visually in RasMol and numerically. Overall, this is a simple geometric problem, i.e., the shortest distance from a point to a line in 3-dimensional space. Given below is the Octave/Matlab script for calculating the distances for G175 and U176 of PDB entry 1JJ2 (the large ribosomal subunit of Haloarcula marismortui):
function d = get_p3_nc_dist(P3, C1, N)
    N_C1 = N - C1;                 # vector from N to C1'
    nv_N_C1 = N_C1 / norm(N_C1);   # normalized vector
    C1_P3 = P3 - C1;               # vector from C1 to P3
    proj = dot(C1_P3, nv_N_C1);
    d  = norm(C1_P3 - proj * nv_N_C1);
end

## G175 (1jj2)
P3 = [70.104 112.366  44.586];
C1 = [73.017 109.666  45.304];
N = [74.445 109.380  45.288];
d1 = get_p3_nc_dist(P3, C1, N)    # 2.2 Å -- C2'-endo

## U176 (1jj2)
P3 = [66.871 116.402  46.804];
C1 = [68.213 112.454  49.279];
N = [69.678 112.480  49.438];
d2 = get_p3_nc_dist(P3, C1, N)    # 4.6 Å -- C3'-endo
The GpU used in the above example forms a dinucleotide platform, where the sugar of G175 adopts a C2'-endo conformation, and that of U176 has C3'-endo. Indeed, the distance for the G175 nucleotide is 2.2 Å, less than 2.9 Å; whilst the value for U176 is 4.6 Å, greater than 2.9 Å.

It is worth noting the above mentioned articles from Richardson et al. are focused on RNA backbone, without paying attention to the base (pair) geometry. The Zp parameter, which quantifies the z-coordinate of the phosphorus atom in the mean reference frame (see "A-form conformational motifs in ligand-bound DNA structures", JMB 2000), can be easily adapted to the analysis of single stranded RNA structures. For example, the vertical distances of the 3' phosphorus atoms to the G175 and U176 base planes are 1.9 Å and 4.4 Å, respectively.

Since base planes and the phosphorus atoms are the most accurately located entities in a given nucleic acid structure, the nucleotide-based Zp variant presumably would have some advantage over the distance from phosphorus to the glycosidic bond. Naturally, this Zp parameter will be added in future releases of 3DNA.

Saturday, June 11, 2011

Conformation of the sugar ring in nucleic acid structures

The conformation of the five-membered sugar ring in DNA/RNA structure can be characterized using the five corresponding endocyclic torsion angles (see Figure below).
i.e.,
v0: C4'-O4'-C1'-C2'
v1: O4'-C1'-C2'-C3'
v2: C1'-C2'-C3'-C4'
v3: C2'-C3'-C4'-O4'
v4: C3'-C4'-O4'-C1'
Due to the ring constraint, the conformation can be characterized approximately by 5 - 3 = 2 parameters. Using the concept of pseudorotation of the sugar ring, the two parameters are the amplitude (τm) and phase angle (P).

One set of widely used formula to convert the five torsion angles to the pseudorotation parameters is due to Altona & Sundaralingam: "Conformational Analysis of the Sugar Ring in Nucleosides and Nucleotides. A New Description Using the Concept of Pseudorotation" [J. Am. Chem. Soc., 1972, 94(23), pp 8205–8212]. As always, the concept is best illustrated with an example. Here I use the sugar ring of G4 (chain A) of the Dickerson-Drew dodecamer (1bna/bdl001), with Matlab/Octave code:
# xyz coordinates of the sugar ring: G4 (chain A), 1bna/bdl001
ATOM     63  C4'  DG A   4      21.393  16.960  18.505  1.00 53.00
ATOM     64  O4'  DG A   4      20.353  17.952  18.496  1.00 38.79
ATOM     65  C3'  DG A   4      21.264  16.229  17.176  1.00 56.72
ATOM     67  C2'  DG A   4      20.793  17.368  16.288  1.00 40.81
ATOM     68  C1'  DG A   4      19.716  17.901  17.218  1.00 30.52

# endocyclic torsion angles:
v0 = -26.7; v1 = 46.3; v2 = -47.1; v3 = 33.4; v4 = -4.4
Pconst = sin(pi/5) + sin(pi/2.5)  # 1.5388
P0 = atan2(v4 + v1 - v3 - v0, 2.0 * v2 * Pconst);  # 2.9034
tm = v2 / cos(P0);  # amplitude: 48.469
P = 180/pi * P0;  # phase angle: 166.35 [P + 360 if P0 < 0]
The Altona & Sundaralingam (1972) pseudorotation parameters are what have been adopted in 3DNA. The Curves+ program, however, uses another set of formula due to Westhof & Sundaralingam: "A Method for the Analysis of Puckering Disorder in Five-Membered Rings: The Relative Mobilities of Furanose and Proline Rings and Their Effects on Polynucleotide and Polypeptide Backbone Flexibility" [J. Am. Chem. Soc., 1983, 105(4), pp 970–976]. The two sets of formula by Altona & Sundaralingam (1972) and Westhof & Sundaralingam (1983) give slightly different numerical values for the two pseudorotation parameters  (amplitude τand phase angle P).

Since 3DNA and Curves+ are currently the most commonly used programs for conformational analysis of nucleic acid structures, the subtle differences in pseudorotation parameters may cause confusions for users who use both programs. With the same G4 (chain A, 1bna) sugar ring, here is the Matlab/Octave script showing how Curve+ calculates the pseudorotation parameters:
# xyz coordinates of sugar ring G4 (chain A, 1bna/bdl001)

# endocyclic torsion angles, same as above
v0 = -26.7; v1 = 46.3; v2 = -47.1; v3 = 33.4; v4 = -4.4

v = [v2, v3, v4, v0, v1]; # reorder them into vector v[]
A = 0; B = 0;
for i = 1:5
    t = 0.8 * pi * (i - 1);
    A += v(i) * cos(t);
    B += v(i) * sin(t);
end
A *= 0.4;   # -48.476
B *= -0.4;  # 11.516

tm = sqrt(A * A + B * B);  # 49.825

c = A/tm; s = B/tm;
P = atan2(s, c) * 180 / pi;  # 166.64

For this specific example, i.e., the sugar ring G4 (chain A, 1bna/bdl001), the pseudorotation parameters as calculated by 3DNA following Altona & Sundaralingam (1972) and Curves+ following Westhof & Sundaralingam (1983) are as follows:

         amplitude (τm)     phase angle (P)
3DNA        48.469             166.35
Curves+     49.825             166.64
Needless to say, the differences are subtle, and few people will notice/bother at all. For those who do care about such little details, however, hopefully this post will help you understand where the differences actually come from.