Friday, May 15, 2009

PDB ATOM coordinates record

PDB format is one of the standard formats for biological macromolecular structures (proteins, DNA/RNA, their complexes, etc). It came into existence when the initial Brookhaven PDB was established in 1971. Over the years, PDBML and mmCIF have been proposed as substitutes to allow for more flexibilities, yet PDB format is still the mostly commonly used one. Software dealing with PDB structures each has its own parser, at least for the Coordinate Section (especially the ATOM and HETATM records).

Overall, PDB format is simple and very well documented. The simplicity lies just in its 'rigidity', in FORTRAN 77 style. The ATOM/HETATM record description is excerpted below for easy reference:

COLUMNS DATA TYPE FIELD DEFINITION
-------------------------------------------------------------------------------------
1 - 6 Record name "ATOM "
7 - 11 Integer serial Atom serial number.
13 - 16 Atom name Atom name.
17 Character altLoc Alternate location indicator.
18 - 20 Residue name resName Residue name.
22 Character chainID Chain identifier.
23 - 26 Integer resSeq Residue sequence number.
27 AChar iCode Code for insertion of residues.
31 - 38 Real(8.3) x Orthogonal coordinates for X in Angstroms.
39 - 46 Real(8.3) y Orthogonal coordinates for Y in Angstroms.
47 - 54 Real(8.3) z Orthogonal coordinates for Z in Angstroms.
55 - 60 Real(6.2) occupancy Occupancy.
61 - 66 Real(6.2) tempFactor Temperature factor.
77 - 78 LString(2) element Element symbol, right-justified.
79 - 80 LString(2) charge Charge on the atom.

It won't take much time/lines in a script language such as Perl/Python/Ruby etc to extract specific information one is interested in, e.g., atomic coordinates. However, there are some subtleties that are beyond simple script parsers. On top of that, one needs to understand that not all self-claimed PDB files are standard compliant.

More specifically, a decent PDB format parser must take the following into considerations:
  • The four-character atom name specified in columns 13 to 16. Each biological molecule has a convention in naming atoms. For example, the two H-bonds of the A-T pair are between " N1 " (A) to " N3 " (T), and " N6 " (A) to " O4 " (T). In this regard, it worths noting that babel/openbabel converted PDB files do not follow such naming convention.
  • The one-character alternate location indicator (altLoc) in column 17
  • The one-character residue insertion code (iCode) in column 27.
  • Others details to follow ...
If you are serious about your PDB format parser, there is a very simple thing to do: run it against all the NDB entries if you are working on nucleic acid structures (that's what I did for 3DNA), and the whole PDB entries if you are interested in protein structures. If it does not crash and does what it has been designed for, then congratulate yourself!

No comments:

Post a Comment

You are welcome to make a comment. Just remember to be specific and follow common-sense etiquette.