Xiang-Jun's Corner

Sunday, August 9, 2009

On third-party interfaces to 3DNA

One clear sign of 3DNA's acceptance by the nucleic-acid structure-related scientific community is its integration into a wide range of GUI/web-interfaces (e.g., NDB, w3DNA, 3D-DART etc) and many other packages. As the author and maintainer of the 3DNA software package, I am (of course) very happy to see this happen: I can easily imagine that better accessibility would make 3DNA open to an even larger community, including non-experts in nucleic acid structures (e.g., for educational purpose). It should be emphasized, however, that point-and-click interfaces to 3DNA (and any other serious scientific software, for that matter), while convenient to occasional users and for routine jobs, have their limitations.

Overall, a point-and-click interface is only sensible to well-defined, routine tasks. In the context of 3DNA, the following list of tasks (not necessarily complete) is perfectly suitable:

Generate one of the 55 fiber models, where everything (model number, sequence or number of repeats) can be unambiguously defined through a web-form
Build an arbitrary DNA model with a user-specific parameter set
Create a blocview image for a nucleic-acid containing structure specified in PDB format
Analyze a regular (i.e., not deviating too much) double helix structure, be it in A-, B-, or Z-form
Calculate a list of backbone torsion angles and other parameters (with the "-s" option of find_pair and then analyze)
Find a list of all possible (RNA) base-pairs fulfilling a specific set of geometric criteria (with the "-p" option of find_pair).

Other than those (and carefully selected/qualified other functionality), users should be very careful in what they get from running a third-party tool outputting 3DNA-related parameters. I know of the so many subtleties in using 3DNA to solve a specific problem, and most of the time the command-line driven style is the most effective, as it allows for easy automation of try-and-error iterations. Software developers who integrate 3DNA into their packages should make the limitations clear to their users, and direct any 3DNA-specific problems to the 3DNA forum.

In today's informatics world, there are so many "easy-to-use" tools available, claiming to be able to solve all sorts of problems (well, that's understandable -- otherwise, how could one get published, especially in the big journals/magazines?). Any serious scientist, however, should know what he/she is doing: it is easy to get some (magic) numbers by clicking a button, but understand clearly what the numbers mean is yet another story. I cannot emphasize more the importance of knowing one's tools, including their limitations.

Saturday, August 8, 2009

Prefatory articles in Annual Review of Biochemistry make good reading

Over the time, I have read several (mostly structural biology related) prefatory articles published in the Annual Review of Biochemistry and found them quite interesting. Some examples:

In Vol. 78 (July 2009), James Wang's article, titled "A Journey in the World of DNA Rings and Beyond", told a story behind the discovery of DNA topoisomerase, an enzyme that converts one form of DNA ring to another.
In Vol. 73 (July 2004), Alexander Rich wrote about "The Excitement of Discovery" as a scientist. Rich's research has often been "on the question how molecular structure leads to biological function". I am especially impressed by his vivid description of the work with Crick on the structure of collagen (p.10-12) and its "strong positive effect" on his psyche:
For one thing, I began to develop some self-assurance in my ability to carry out research and make discoveries. I believe that a form of “scientific maturation” is an important component in developing a confident thrust into research work.
In Vol. 40 (July 1971), John Edsall provided interesting insights about his editorial work in his article titled "Some Personal History and Reflections from the Life of a Biochemist".

Overall, such articles are well worth reading -- they provide background information and the context on the discoveries made by those leading scientists.

Sunday, July 26, 2009

Meaning of nucleotide IUPAC codes

Today, anyone with some basic knowledge of biochemistry should be familiar with A, C, G and T, the four bases of DNA, and probably the A-T and G-C Watson-Crick base-pairs as well. The meaning of the one-letter abbreviations is very clear: A for Adenine, C for Cytosine, G for Guanine, and T for Thymine. Of course, for RNA, there is the U (for Uracil) in place of T of DNA.

In the early days when I entered into the field of DNA structure, I also learned that R stands for puRine, i.e., A and G, and Y for pYrimidine, i.e., C and T (U). Trained as a chemist, I had no difficult at all in understanding and remembering them. To process base sequences in bioinformatics projects, I have come across the IUPAC degeneracy codes of nucleotides, such as S, W, M, K D, V, etc, which I had never been able to really memorize what they represent, except for N (A, C, G, T).

My confusions have been clarified completely, however, due to the web document I happened to find: "Nomenclature for Incompletely Specified Bases in Nucleic Acid Sequences" (1984) by the Nomenclature Committee of the International Union of Biochemistry (NC-IUB). This is the document I wish I could have known of from the very beginning. For completeness of this post, here is a summary table of the whole IUPAC codes. It is based on Table 1 in the above document except for uracil and gap:

Symbol Meaning          Origin of designation
-----------------------------------------------------------
G      G                Guanine
A      A                Adenine
T      T                Thymine
C      C                Cytosine
U      U                Uracil
R    G or A             puRine
Y    T or C             pYrimidine
M    A or C             aMino
K    G or T             Keto
S    G or C             Strong interaction (3 H bonds)
W    A or T             Weak interaction (2 H bonds)
H    A or C or T        not-G, H follows G in the alphabet
B    G or T or C        not-A, B follows A
V    G or C or A        not-T (not-U), V follows U
D    G or A or T        not-C, D follows C
N    G or A or T or C   aNy
    . or -             gap
-----------------------------------------------------------

Saturday, July 25, 2009

PDB id vs NDB id

For nucleic-acid-containing structures, PDB and NDB are the two most widely used databases (databanks). Both PDB and NDB are maintained at Rutgers University. Among the two, PDB is primary, of which NDB is essentially a subset with extra derived parameters regarding base-pair geometry.

As is always the case, each entry is uniquely identified by an id in a database. Interestingly, PDB and NDB have adopted radically different approaches in picking up their ids.

PDB id is (currently) 4 characters long: the first character is a numeral in the range 1-9, while the rest can be either numerals or letters. Early PDB entries could be acronyms. For example, 1bna for the famous Dickerson-Drew B-DNA dodecamer with sequence CGCGAATTCGCG, the first full turn B-DNA duplex; and 1mbn for myoglobin, the first solved protein structure. Recently, due to the quick increase of deposited macromolecular structures, the PDB ids "are automatically assigned and do not have any meaning." (page 9)
NDB id by design seems to contain more information, even though detailed specifications cannot be located from online search. For examples, A-DNA, B-DNA and Z-DNA ids start with AD, BD, and ZD, respectively; protein-DNA complexes start with PD; and ribosomal RNAs start with RR etc. Furthermore, the third letter also has a meaning in the NDB code. E.g., L in BDL084 means 12 since it is the 12th letter in English alphabet, thus we know BDL084 is a B-DNA dodecamer. Similarly, the H in ADH026 means 8, thus ADH026 is an A-DNA octomer.

Since NDB (1992) appeared much late than PDB (1971) and was developed as a better database for macromolecular structures than the PDB (at that time), it is conceivable that its id scheme was part of the initial NDB design. However, even though the NDB id serves its purpose well (up to now, and in a broad sense), users need to be aware of one fundamental flaw inherent in the literal meaning of the NDB ids. As a concrete example, for the Ng et al. (1999) crystal structure of an A/B-DNA intermediate, PDB assigned it an id of 1dc0 -- no intuitive meaning or misleading, just an identifier. In contrast, NDB assigned it an id of BD0026, meaning B-DNA, following the pattern noted above, which is clearly misleading. Moreover, the structure is actually more similar to A-DNA than to B-DNA, as far as the characteristic parameters distinguishing A- and B-DNA -- slide, chi torsion angle, and sugar conformation -- are concerned.

I have no idea of how many such mis-picked ids exist in the NDB. What is clear is that as more and more weird structures (especially RNA) are deposited (or extracted from the PDB), it would be even harder to pick up an id in its 'canonical' sense. Inconsistency will then become a big issue. In contrast, PDB ids do not have such a problem by design, whether an id is an acronym or a random, automatic pick by a software program.

Over the years, NDB has served me no other purposes than as a pre-selected subset of PDB entries containing nucleic acid structures. It has become clear to me that starting directly from PDB would be a better choice, if nothing but to reduce a level of redundancy, and to avoid possible mis-leading ids.

Friday, July 17, 2009

Does open access increase citation?

In the July 17, 2009 issue of Science, there are several letters discussing the brevia titled "Open Access and Global Participation in Science" by Evans and Reimer who reported that:

The influence of OA [open access] is more modest than many have proposed, at ~8% for recently published research, but our work provides clear support for its ability to widen the global circle of those who can participate in science and benefit from it.

On one hand, Philip Davis from Cornell University argued that "Open Access: Increased Citations Not Guaranteed" in the title of his letter. On the other hand, Michael Eisen (HHMI and UC Berkeley) and Steven Salzberg (University of Maryland) stated that "Open Access: The Sooner the Better" -- In their opinion, "the 8% statistic that Evans and Reimer highlight is misleading. .... In particular, when articles were made freely available within 2 years of publication, their citations increased by almost 20%." Interesting, the same issue also published Evens' response, addressing comments and criticisms from other scientists.

Another very interesting point: Eisen and Salzberg also expressed their concern about the unavailability of the raw citation information used in the Evans and Reimer report, saying this "is an astonishing violation of the norms of science, and the explicitly stated publication policies of Science." In response to this point, Science's Editor noted that "[Science] do not preclude our authors from obtaining data from commercial sources when those are the only sources of the data and when those data are available to the scientific community."

Overall, the discrepancies about the influence of open access on citations, and how to possibly resolve them in relation to the availability of the original data, are typical in science. Presumably, this is a relatively simple case. Yet, there are still many variables in data selections and interpretations etc. Even if the "raw data" are made available, which certainly would be a big help, I still doubt that the discrepancies could be resolved. On the other hand, the "raw data" are only secondary in the sense that they were collected by Evans and Reimer using some specific criteria. If the detailed steps are made available such that the reported figures and tables can be reproduced by those who have access to the commercial sources, then things would become clear. In other words, it is not just the data nor the numbers (in published figures and tables), but the exact procedures, of how the numbers have been produced from the original data, that could provide a convincing resolution (if there is one).

Saturday, July 11, 2009

On maintaining the 3DNA forum

Over the past few years, maintaining the 3DNA forum (i.e., answering questions, performing administrative tasks) has taken up a significant amount of my spare time. Sometimes it could be quite demanding, especially because I need to pay great attention to details. Overall, though, it is a valuable experience, and I feel that the time is well-spent: 3DNA has been continuously refined and more widely used; my knowledge of nucleic acid structures (especially RNA) has been significantly sharpened; I have stayed aware of progress in related research fields and see more of the world; and I feel great pleasure in being of help to the community.

Some basic facts/statistics:

I tell everyone who sends me an email to ask a 3DNA-related question to register and re-post in the 3DNA forum, but less than 50% actually do this. However, if you want to be helped, you have to follow the rules.

Most of the forum registrations (over 90% at times) are spam. To forestall this, I must continually update phpBB3 to its latest version.

Over 50% of legitimate registrations end up being deleted instead of being activated due to the users' failure to send me an email for activation, as required. Among those whose send me an email, most use a subject line of "Re: 3DNA forum registration — 'user-id'", as suggested. Only a few volunteer to share with me their real name, address, etc. (e.g., via a signature). Furthermore, some do not post back in the forum after their account is activated.

With very few exceptions, questions posted in the forum are normally addressed within a couple of days, or even sooner.

Except for one case, communicating with users has mostly been a pleasant experience. Some (though not many) users even posted back a summary and/or a thank-you note.

I would like to compliment esguerra, yrxin and tgaillar for sharing their tips and tricks in the section Users' contributions. Apart from me, ghzheng has contributed the most in the forum.

Overall, the forum is low volume (which is just fine) and spam-free (which is very important).

To make the forum policy upfront and explicit, in order to avoid misunderstandings or surprises, I have been enclosing the following note in each new registration confirmation message:

Your 3DNA forum registration has been activated — welcome aboard! See
http://xiang-jun.blogspot.com/2009/07/on-maintaining-3dna-forum.html

---------------------------------------------------------------------
I am so pleased that you have come thus far! To make the 3DNA forum a
more pleasant virtual community for all of us to learn from and
contribute to, please be considerate and practice good netiquette
(http://www.albion.com/netiquette/). More specifically, I would like
to reemphasize the following:

0. Do your homework; read the FAQ and browse the forum.

1. Ask your questions in the 3DNA forum instead of sending me emails.

2. Be specific with your questions; provide a minimal, reproducible
  example if possible; use attachments where appropriate.

3. Do not ask for or expect immediate responses to your questions.
  Lower your expectations and you will more likely end up feeling
  happier.

4. Respond to requests for clarifications.

5. Summarize the solution to your problem(s) from a user's
  perspective by providing details, for the benefit of other users.

6+ Contribute back to 3DNA if you can:
    o Report bugs — including typos
    o Make constructive suggestions — anything to make 3DNA better
    o Answer other users' questions
    o Share your use cases in the "Users' contributions" section

In a nutshell, you are welcome to participate and should not hesitate
to ask questions, but remember to play nice and preferably share what
you've learned!
---------------------------------------------------------------------

Thus, intentionally or otherwise, the forum has also acted as a filter to make my life easier. Whenever possible, though, I have tried my best to reward those who follow the simple, common sense rules. After all, nothing should be taken for granted, and no one likes to be taken advantage of. I am glad that through my contributions and user involvement, the forum has survived and 3DNA has thrived (evident from citations, numerous web links to its homepage, other services/tools — including NDB and PDB — taking advantage of parts of its functionality, and more recently, two dedicated web-interfaces), serving as a valuable resource to the community.

PS: Two related posts in the 3DNA forum:

Welcome message from Xiang-Jun Lu

Activating your newly registered 3DNA forum account

Friday, July 10, 2009

Does 3DNA work for RNA?

At the C2B2 party this afternoon, I was asked the question: "Does 3DNA work for RNA?" Well, a good question, indeed. The short answer is definitely, YES. However, a detailed explanation is needed to address the underlying intuitive assumption: 3DNA is only for DNA.

The name 3DNA was due to Dr. Olson, after we struggled quite a while. Initially, we played with NuStar (which was actually cited once by Richard Dickerson et al), and Carnival etc. I still remember the day when Dr. Olson asked me "How about 3DNA?" We immediately reached an agreement: that's it -- what a cute name! Another advantage (as it becomes clear later): since 3DNA starts with '3', it (mostly) shows up right at the top of many on-line lists of bioinformatics tools.
Interpreted literally, 3DNA could mean 3-DNA, i.e., the three most common types of DNA: A-, B- and Z-form. That may be one of the reasons where the misconception that 3DNA is only for 3DNA comes from. Another reason could be that structural work on DNA is what the Olson lab best known for.
The number '3' in 3DNA should also be associated with its three key components: analysis, rebuilding and visualization. In a sense, this is my favorite.
Of course, 3DNA stands for 3D-NA, 3-Dimensional Nucleic Acids, as expressed explicitly in the titles of our two 3DNA papers (2003 NAR and 2008 NP).

The applications of 3DNA to RNA structures can be broadly categorized as follows:

Automatically detect all existing base-pairs, Watson-Crick (A-U, G-C, wobble G-U) or non-canonical, using a set of simple geometric criteria. Furthermore, it has a unique base-pair classification system based on the six numerical structural parameters, suitable for database storage and search.
Automatically detect all triplets or higher-order base-associations.
Automatically detect double helical regions, regardless of backbone connection, thus ideal for finding pseudo-continuous coaxial stacking.
The above three features are seamlessly integrated with the visualization component to allow for easy generation of publication quality images. See the 3DNA 2008 NP paper for detailed examples.

As further examples, the following two RNA publications take advantage of find_pair from 3DNA:

R. Tyagi & D. H. Mathews (2007). Predicting Helical Coaxial Stacking in RNA Multibranch Loops. RNA. 13: 939 - 951. See the note from the authors' webpage for clarification of mis-citation to find_pair.
R. Capriotti & M. A. Marti-Renom (2009). SARA: a Server for Function Annotation of RNA Structures.

It is well worth noting that the base-pair detecting algorithm in RNAView is based on an earlier version of find_pair, a basic fact ignored in the RNAView publication.

In summary, 3DNA works for RNA as well as for DNA, and more.

Sunday, July 5, 2009

Errors in PDB entries

In the June 24, 2009 issue of Nature (v459, pp.1038-1039), there is an news item titled "New protein structures replace the old" by Katharine Sanderson, on a 'Dutch software to weed out errors in Protein Data Bank'.

In my experience with software development and using the PDB/NDB, it is certainly not a surprise that there are errors of various types in the macromolecular databases: whenever I apply an algorithm consistently to all the entries in the NDB (which is part of PDB, consisting of only nucleic-acid containing structures), I always notice some inconsistencies. As a more concrete example, blocview, a visualization tool initially developed as a by-product of another project while I was still at Rutgers (and partially involved with the NDB), was once used for correcting errors in the NDB as well.

Pure 're-refinement' of existing structures with software is surely helpful in catching obvious, systemic errors. However, it is impossible to catch all problems, no matter how sophisticated the software could be. Moreover, as put in the comment by yet another phd: "i have hard enough time getting the RCSB to change four atoms in a structure for me." and "scientists must remember that when they click the re-refine button, you read the paper where the structure was reported."

Errors will always be there -- that's just a basic fact of life. It is thus crucial for those who perform structural analysis to draw their conclusions based on not one or just a few purposely selected structures, but on a more objective and extensive ground.

Two web-interfaces to 3DNA, and more

As a nice surprise, I found in the 2009 web-server issue of NAR published on July 1, two articles on web-interface to 3DNA back-to-back:

"3D-DART: a DNA structure modelling server" by van Dijk and Bonvin from Utrecht University, The Netherlands. Excerpt from the abstract:
As a response to the demand for 3D-structural models reflecting the intrinsic plasticity of DNA we present the 3D-DART server (3DNA-Driven DNA Analysis and Rebuilding Tool). The server provides an easy interface to a powerful collection of tools for the generation of DNA-structural models in custom conformations. The computational engine beyond the server makes use of the 3DNA software suite together with a collection of home-written python scripts.
"Web 3DNA—a web server for the analysis, reconstruction, and visualization of three-dimensional nucleic-acid structures" by Zheng, me and Olson. Excerpt from the abstract:
The w3DNA (web 3DNA) server is a user-friendly web-based interface to the 3DNA suite of programs for the analysis, reconstruction, and visualization of three-dimensional (3D) nucleic-acid-containing structures, including their complexes with proteins and other ligands.

While I was aware of 3D-DART prior to its publication, I certainly did not expect it to appear in the same issue as w3DNA of which I am a co-author. There is nothing more compelling to illustrate 3DNA's value to the community than a third-party web-interface to it! Combined together, these two web-servers make 3DNA much more accessible to even wider audience. Specifically, they could well serve for education purposes, e.g., to conveniently build a DNA-model of A-, B- or C-form, with user-supplied sequences. Users would be glad to have a choice that better fits their needs. In the long run, the one which provides best user-support will survive.

It is also worthy noting the in the same 2009 NAR web-server issue, another paper titled "SARA: a server for function annotation of RNA structures" by Capriotti and Marti-Renom from Spain also makes use of 3DNA. This serves to emphasize the point that 3DNA is not just for DNA, but for RNA as well -- 3DNA has unique features for RNA that are not found in other currently available software tools that I am aware of.