Saturday, July 25, 2009

PDB id vs NDB id

For nucleic-acid-containing structures, PDB and NDB are the two most widely used databases (databanks). Both PDB and NDB are maintained at Rutgers University. Among the two, PDB is primary, of which NDB is essentially a subset with extra derived parameters regarding base-pair geometry.

As is always the case, each entry is uniquely identified by an id in a database. Interestingly, PDB and NDB have adopted radically different approaches in picking up their ids.
  • PDB id is (currently) 4 characters long: the first character is a numeral in the range 1-9, while the rest can be either numerals or letters. Early PDB entries could be acronyms. For example, 1bna for the famous Dickerson-Drew B-DNA dodecamer with sequence CGCGAATTCGCG, the first full turn B-DNA duplex; and 1mbn for myoglobin, the first solved protein structure. Recently, due to the quick increase of deposited macromolecular structures, the PDB ids "are automatically assigned and do not have any meaning." (page 9)
  • NDB id by design seems to contain more information, even though detailed specifications cannot be located from online search. For examples, A-DNA, B-DNA and Z-DNA ids start with AD, BD, and ZD, respectively; protein-DNA complexes start with PD; and ribosomal RNAs start with RR etc. Furthermore, the third letter also has a meaning in the NDB code. E.g., L in BDL084 means 12 since it is the 12th letter in English alphabet, thus we know BDL084 is a B-DNA dodecamer. Similarly, the H in ADH026 means 8, thus ADH026 is an A-DNA octomer.
Since NDB (1992) appeared much late than PDB (1971) and was developed as a better database for macromolecular structures than the PDB (at that time), it is conceivable that its id scheme was part of the initial NDB design. However, even though the NDB id serves its purpose well (up to now, and in a broad sense), users need to be aware of one fundamental flaw inherent in the literal meaning of the NDB ids. As a concrete example, for the Ng et al. (1999) crystal structure of an A/B-DNA intermediate, PDB assigned it an id of 1dc0 -- no intuitive meaning or misleading, just an identifier. In contrast, NDB assigned it an id of BD0026, meaning B-DNA, following the pattern noted above, which is clearly misleading. Moreover, the structure is actually more similar to A-DNA than to B-DNA, as far as the characteristic parameters distinguishing A- and B-DNA -- slide, chi torsion angle, and sugar conformation -- are concerned.

I have no idea of how many such mis-picked ids exist in the NDB. What is clear is that as more and more weird structures (especially RNA) are deposited (or extracted from the PDB), it would be even harder to pick up an id in its 'canonical' sense. Inconsistency will then become a big issue. In contrast, PDB ids do not have such a problem by design, whether an id is an acronym or a random, automatic pick by a software program.

Over the years, NDB has served me no other purposes than as a pre-selected subset of PDB entries containing nucleic acid structures. It has become clear to me that starting directly from PDB would be a better choice, if nothing but to reduce a level of redundancy, and to avoid possible mis-leading ids.