Saturday, May 16, 2009

Fiber, analyze and rebuild in 3DNA

While browsing the current issue of Nucleic Acids Research (May 2009, Vol. 37, No. 8), I noticed, surprisingly, the article titled Reconstitution of ‘floral quartets’ in vitro involving class B and class E floral homeotic proteins by Rainer Melzer and Günter Theißen from Jena, Germany.

From the title and abstract, I would not expect 3DNA could play a role here. Yet in section Modeling a floral quartet, the authors used fiber to building a B-DNA model of floral quartets, using sequence between the XhoI and XbaI sites of probe A, then analyze the structure, and modified the resultant file 'bp_step.par' accordingly to their specifications, and rebuild a customized DNA model. In the following paragraph, the authors used the analyze/rebuild pair again for the CArG box bound to serum response factor.

This is the intended usage of the 3DNA software package I have had in mind from its initial design. I emphasized such versatile, integrated approach unique with 3DNA in our 2008 Nature Protocols paper. As time goes by, I am confident that more people will begin to appreciate this approach.

Friday, May 15, 2009

PDB ATOM coordinates record

PDB format is one of the standard formats for biological macromolecular structures (proteins, DNA/RNA, their complexes, etc). It came into existence when the initial Brookhaven PDB was established in 1971. Over the years, PDBML and mmCIF have been proposed as substitutes to allow for more flexibilities, yet PDB format is still the mostly commonly used one. Software dealing with PDB structures each has its own parser, at least for the Coordinate Section (especially the ATOM and HETATM records).

Overall, PDB format is simple and very well documented. The simplicity lies just in its 'rigidity', in FORTRAN 77 style. The ATOM/HETATM record description is excerpted below for easy reference:

COLUMNS DATA TYPE FIELD DEFINITION
-------------------------------------------------------------------------------------
1 - 6 Record name "ATOM "
7 - 11 Integer serial Atom serial number.
13 - 16 Atom name Atom name.
17 Character altLoc Alternate location indicator.
18 - 20 Residue name resName Residue name.
22 Character chainID Chain identifier.
23 - 26 Integer resSeq Residue sequence number.
27 AChar iCode Code for insertion of residues.
31 - 38 Real(8.3) x Orthogonal coordinates for X in Angstroms.
39 - 46 Real(8.3) y Orthogonal coordinates for Y in Angstroms.
47 - 54 Real(8.3) z Orthogonal coordinates for Z in Angstroms.
55 - 60 Real(6.2) occupancy Occupancy.
61 - 66 Real(6.2) tempFactor Temperature factor.
77 - 78 LString(2) element Element symbol, right-justified.
79 - 80 LString(2) charge Charge on the atom.

It won't take much time/lines in a script language such as Perl/Python/Ruby etc to extract specific information one is interested in, e.g., atomic coordinates. However, there are some subtleties that are beyond simple script parsers. On top of that, one needs to understand that not all self-claimed PDB files are standard compliant.

More specifically, a decent PDB format parser must take the following into considerations:
  • The four-character atom name specified in columns 13 to 16. Each biological molecule has a convention in naming atoms. For example, the two H-bonds of the A-T pair are between " N1 " (A) to " N3 " (T), and " N6 " (A) to " O4 " (T). In this regard, it worths noting that babel/openbabel converted PDB files do not follow such naming convention.
  • The one-character alternate location indicator (altLoc) in column 17
  • The one-character residue insertion code (iCode) in column 27.
  • Others details to follow ...
If you are serious about your PDB format parser, there is a very simple thing to do: run it against all the NDB entries if you are working on nucleic acid structures (that's what I did for 3DNA), and the whole PDB entries if you are interested in protein structures. If it does not crash and does what it has been designed for, then congratulate yourself!