Sunday, June 5, 2011

Lower case chain identifiers in PDB format

First formulated in early 1970s, the PDB format is rigid with fixed columns for designated contents in its ATOM/HETATM records. Specificlly, a single column, #22, is assigned for the chain identifier (id). Traditionally, the 26 upper case letters of English alphabet (A-Z), space (i.e., ' '), and the single digits (0-9) have been used as chain ids. Up until the ribosomal structures came up, I guess, those 26 + 1 + 10 = 37 characters had been sufficient for the chain ids.

To the best of my knowledge, for a long time, most PDB parsers assume upper case chain ids. Indeed, 3DNA v1.5 automatically converts each ATOM/HETATM records to upper cases. The first time I became aware of lower case chain ids was when I saw a post in the 3DNA forum, titled "Small bug in find_pair", where a user reported the 'w' vs 'W' chain ids in PDB entry 1VSP. Then I refined 3DNA so that the case of chain ids can be preserved, through an undocumented command line option (as a feature for internal testing purpose).

My view to make 3DNA chain ids case-sensitive has been reinforced when I read the article "Crystal structures of CGG RNA repeats with implications for fragile X-associated tremor ataxia syndrome" recently published in Nucleic Acids Research. The asymmetric unit of the unmodified CGG-repeats-containing duplex (GCGGCGGC)2, NDB entry NA1017 / PDB entry 3R1C, contains a total of 36 chains: designated as A-Z, plus a-j. Without distinguishing cases of the chain ids, the 3DNA output would become quite confusing.

Thus, in future releases of 3DNA, the default would be switched to preserve the case of chain ids. This chain id 'case' serves as an excellent example that scientific software products, unlike publications per se, are not fixed but need continuous care and maintenance to meet the challenges of an evolving world.

No comments:

Post a Comment

You are welcome to make a comment. Just remember to be specific and follow common-sense etiquette.