Xiang-Jun's Corner

Saturday, January 23, 2010

Chemical diagram of Watson-Crick base-pairs

Once in awhile, I need to refresh my memory about the chemical identities of the most common nucleobases: A, C, G, T/U. Sometimes, it is also necessary to explain to non-(bio)chemists about the concept of H-bond donor vs acceptor, and the major- vs minor-groove of the DNA double helix. In such cases, I use the following (customized) chemical diagram of Waton-Crick base-pairs (WC-bp):

Before taking the effort to create my own version of the Waton-Crick base-pair diagram, I googled around and found many illustrations (like the one in wikipedia). However, none of them suits my needs perfectly:

Trained as a chemist, I would like to see chemical bond types (double vs. single bond);
Working extensively with PDB format, I want to have the atom numbering information as well.

So I ended up to (re)create my own version of the WC-bp diagram: I used Chemtool to sketch the framework, and Xfig to fine-tune it. Overall, the diagram serves my purpose quite well, and hopefully others would find it useful as well.

Thursday, January 7, 2010

Requests for SCHNAaP/SCHNArP source code

Recently, I received several requests for the source code of the SCHNAaP/SCHNArP, a software package for the analysis and rebuilding of double helical nucleic acid structures. This suite of programs was developed ten years ago during my PhD work on DNA base-stacking interactions with Dr. Chris Hunter at the University of Sheffield, England.

Users become interested in SCHNAaP/SCHNArP mostly because of 3DNA, which can be taken as its superseded, more popular version. Due to Rutgers' policy of not releasing the source code of 3DNA, users who would like to understand details of the underlying algorithms thus turn back to SCHNAaP/SCHNArP. The interface is a bit aged, but the mathematics is still valid: it could serves well as a start point for those who really want to get into the world of nucleic acid structures.

Overall, though, I have a mixed feeling in this situation. On one hand, I am happy to see people becoming interested in my (previous) work. On the other hand, however, it also becomes clear that Rutgers' current licensing policy has blocked 3DNA's further circulation and adoption by the scientific community. Given the current trend of open-source software development, I see no reason to continue keeping 3DNA closed source. Making 3DNA open source (under a proper license term, of course) would allow for interested users to get more directly involved in the project, and thus to move the software to the next level.

Saturday, December 19, 2009

ORCID -- an international research identification system?

From the Nature news article titled "Credit where credit is due" in (462:7275, p. 825 on December 17, 2009), I came cross the ORCID initiative:

Name ambiguity and attribution are persistent, critical problems imbedded in the scholarly research ecosystem. The ORCID Initiative represents a community effort to establish an open, independent registry that is adopted and embraced as the industry’s de facto standard. Our mission is to resolve the systemic name ambiguity, by means of assigning unique identifiers linkable to an individual's research output, to enhance the scientific discovery process and improve the efficiency of funding and collaboration.

Overall, I think it is a good idea. If properly implemented and widely adopted, ORCID could help solve lots of issues associated with various ways of spelling a person's name due to, e.g., cultural differences. For example, put Chinese way, one's family name comes before one's given name, just the opposite of the western convention. Additionally, when a given name has two characters (quite common), there are could be a space or a hyphen (as I normally put in Xiang-Jun) or nothing in between. Combined with possible first name initials, there are already many ways to spell out a Chinese name.

The above Nature article, “Credit Where Credit is Due”, helps introduce the ORCID Initiative. As an specific example, it points to another article on page 843, where Nature profiles a research group trying to "complete the reference human genome sequence, which is still full of errors nearly a decade after the first draft was announced in 2000." Nature acknowledges that "It is essential work", "But it is also work that offers few academic rewards beyond the satisfaction of a job well done — it is unlikely to result in a high profile publication." Hopefully, by adopting the ORCID system, contributions of such types (e.g., software support and maintenance) would be more properly acknowledged by the scientific community.

Given the high profile of the founding parties, I am hopeful that the ORCID initiative would move forward as promised. I will keep an eye on it and see how it evolves.

Friday, December 18, 2009

Ribosomal structure: it helps to know some background information

This year's Nobel Prize in Chemistry has been awarded to Venki Ramakrishnan, Tom Steitz, and Ada Yonath "for studies of the structure and function of the ribosome."

My connection with ribosomal structure began with the 50S large subunit of Haloarcula marismortui solved by Tom Steitz's group. Ever since the fully refined crystal structure at 2.4 Å resolution was published in 2001 (PDB entry: 1jj2; NDB code: rr0033), I have been using it to check 3DNA's applicability. In the two 3DNA papers (2003 NAR and 2008 NP), 1jj2 was used as an example to illustrate how find_pair can identify higher-order base-associations in complicated RNA containing structures. At the time, though, my understanding of the ribosomal RNA structure was purely geometrical: for quite a while, I got overwhelmed by the various biological terminologies, including the various S-es: 50S large ribosomal subunit vs. the 23S and 5S rRNA; and of course, the 30S small subunit vs. 16S rRNA.

Over the past year or so, I have become more interested in RNA structures. After reading a lot of related articles, gradually I feel things are becoming clearer than before. Nevertheless, there is something still missing, since my focus has (mostly) been on recent X-ray crystal structure-related work. My understanding of the ribosomal structure was finally put into context, thanks to following two recent publications:

One in Cell by James Williamson, titled "The Ribosome at Atomic Resolution".
Another one in Mol. Cell (in parallel and at the same time) by Joseph Puglisi, titled "Resolving the Elegant Architecture of the Ribosome".

These two papers not only summarized the significance of work of the three Nobel laureates — "the atomic resolution structures of the ribosomal subunits provide an extraordinary context for understanding one of the most fundamental aspects of cellular function: protein synthesis" — but also provided background information of decades of work from other players, including Harry Noller, Peter Moore, and Joachim Frank. Solving the ribosomal structure serves as a good example of how the fact that scientific research is both cooperative and competitive in nature.

Friday, December 11, 2009

Not all PDB entries are reliable; some could be plain fake

With interest, I have browsed the recent thread in the PDB mailing list (pdb-l), "Retraction of 12 Structures" posted by Michael Sadowski and followed-up by Kevin Karplus et al. The story is about Krishna Murthy, a former scientist at the University of Alabama at Birmingham (UAB), who has been alleged to fabricate protein structures and published papers on them. Here is an informative comment by firebug36 from the above link:

I am a protein crystallographer myself, so just trust me - the results this gentleman [Murthy] published were falsified, and not in a smart way. The structures [for C3b] deposited in the Protein Data Bank made no physical sense.
Allegations against UAB group were first brought to light by several prominent people in the field, and not UAB officials:

http://www.nature.com/nature/journal/v448/n7154/full/nature06102.html

Accordingly to the post of Kevin Karplus, "several of the PDB files by Krishna Murthy's group were identified as problematic in the RosettaHoles paper". Naturally, then, comes the question, "should we remove ALL the PDB files from Krishna Murthy's group as suspect?"

The way Murthy's case coming to spotlight may represent an exception rather than norm. Imagine the scenario that he did not publish his C3b structure in Nature which caught the attention from leading crystallographers (Bert Janssen1, Randy Read2, Axel Brünger and Piet Gros), maybe Murthy is still publishing on protein structures today. In a sense, it is a hard to believe how Murthy could falsify 12 protein structures and published 9 papers in prestigious journals (including Nature, Cell, PNAS, JMB, Biochemistry, JBC etc) which have been cited 449 times.

PDB contains the state-of-the-art experimental data of bio-macromolecular structures. Yet, the archive is certainly full of inconsistencies/errors of various types. It would be helpful to know how many PDB entries are largely or partially wrong, and which can be taken as "gold standard" as far data quality is concerned.

This case gives an excellent lesson for those performing data-mining on macromolecular structures. Nowadays, PDB structures are many and keep increasing rapidly, but they are clearly of varying quality. Structural bioinformatics is about solving biology problems using informatics tools. Thus knowing the caveats of your data (how reliable are they?) and tools (what are their limitations?) is a prerequisite to draw sound scientific conclusions.

Sunday, December 6, 2009

3DNA in the PCCP nucleic acid simulations themed issue

While checking 3DNA-related citations through Web of Science for this past week, I found a total of nine times, as follows:

Five times to the 3DNA 2003 NAR paper
Once to the 3DNA 2008 NP paper
Three times to the 2001 standard base reference frame paper

Most interestingly, all the citations are from the same nucleic acid simulations themed issue of Physical Chemistry Chemical Physics 11 (45). Honestly, I was quite a bit (nicely) surprised by the fact, so I browsed the articles online. Edited by Charles Laughton and Modesto Orozco, the 2009 PCCP "themed issue exemplifies the rich diversity of cutting-edge research in the field of nucleic acids simulation." Indeed, quite a few well-known experts are among the authors of the two perspectives and 16 papers.

While not an "energetic" person myself, over the years I have been keeping an eye on MD simulations and MM calculations of nucleic acid structures. It is my pleasure to see that the 3DNA is being widely used (certainly more than I originally expected) by the nucleic acid simulations community. Given time, and with a suitable collaborator, I am open to consider adapting 3DNA to currently available MD simulation packages to make life easier for practitioners in this "dynamic" field.

Saturday, December 5, 2009

See the effect of C preprocessing with the -E option of gcc

Recently, I was interested in understanding better of one part of a C program, which is very generic, covering a lot of grounds with preprocessing options (#if ... #endif). However, I would like to see the minimum that covers the section I cared about. I vaguely remembered there is an option in the gcc compiler, from reading the book "An Introduction to GCC" a while ago, that can stop the process right after the preprocessing step (just like -c option stops after the compilation step). A quick check of "man gcc" revealed that it is -E:

-E Stop after the preprocessing stage; do not run the compiler proper. The output is in the form of preprocessed source code, which is sent to the standard output.

Input files which don't require preprocessing are ignored.

After setting the proper macros and running gcc with -E, I could immediately focus on the component to get what I wanted.

I have been using gcc for more than ten years, still there are more handy tricks to uncover.

Here is a simplified example showing the effect of the -E option. The following C function is saved in file "check_gcc_E.c".

--------------------------------------------------------------------------
#define GREETING "Hello World"

void check_gcc_E(void) {
#ifdef VERBOSE
   printf("Macro VERBOSE is defined.\n");
   printf("So you see: '%s'\n", GREETING);
#endif
   printf("Hello everyone!\n");
}
--------------------------------------------------------------------------

gcc -E -DVERBOSE check_gcc_E.c

--------------------------------------------------------------------------
# 1 "check_gcc_E.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "check_gcc_E.c"

void check_gcc_E(void) {

   printf("Macro VERBOSE defined\n");
   printf("So you see: '%s'\n", "Hello World");

   printf("Hello everyone!\n");
}
--------------------------------------------------------------------------

gcc -E check_gcc_E.c

--------------------------------------------------------------------------
# 1 "check_gcc_E.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "check_gcc_E.c"

void check_gcc_E(void) {
   printf("Hello everyone!\n");
}
--------------------------------------------------------------------------

Friday, November 20, 2009

Registered COPPA Users in phpBB3

Recently, I received an email from a 3DNA user who registered at the 3DNA forum, but could not see anything at all once logged in. 3DNA forum is based on phpBB3, and has been running for over three years now. So at the very beginning, I thought how could it be? I had never heard of any such problem/complain from 3DNA forum registers before. I even created a temporary test login account and found no problem. So I communicated with the user and asked her/him to log in using my test account, and again everything was fine!

To reproduce the problem, I logged in as the user, and found one thing spurious: the user was in the group of "Registered COPPA Users", not the normal "Registered Users". I did not know what that COPPA stands for. So I googled the phase "Registered COPPA users" and the top hit led me into the phpBB3 document on Group Management, and the section I am interested in reads as follows:

Registered COPPA users are basically the same as registered users, except that they fall under the COPPA, or Child Online Privacy Protection Act, law, meaning that they are under the age of 13 in the U.S.A. Managing the permissions this usergroup has is important in protecting these users. COPPA doesn't apply to users living outside of the U.S.A. and can be disabled altogether.

So a registered COPPA user is, by definition, under the age of 13. By default phpBB3 does not even allow such a child to read any content! In the context of 3DNA forum, this policy simply does not make any sense — the contents (in the public section) are viewable by any one without registration.

It turned out that at registration stage, the first question is: "To continue with the registration procedure please tell us when you were born." Two dynamically generated dates are given, one is "Before" a date defining an age over 13, and the other "On or after" it for below 13. Obviously the 3DNA user mentioned above clicked the wrong button.

After knowing where the problem was and how it was created, fixing it was straightforward. Interestingly, when I then checked the 3DNA forum registered users, I found five of them were in the "Registered COPPA Users" group. Obviously, the previous (wrong) registers did not complain — possibly lost interest in pursuing further, so this issue did not surfaced until recently.

In a real world as we live, what seems simple may not be. Nothing should be taken for granted.

3DNA in PDB

As mentioned previously, PDB makes use of blocview (part of 3DNA) to generate the simple yet effective images for nucleic-acid-containing structures. That's the connection I knew of between 3DNA and PDB. By pure chance, however, I recently noticed the 3DNA entry in PDB — it is actually a protein structure, completely unrelated to the 3DNA software package!

Just out of curiosity, I browsed the abstract of the Liu et al. article, titled "Halogenated benzenes bound within a non-polar cavity in T4 lysozyme provide examples of I...S and I...Se halogen-bonding" [J Mol Biol. 2009 Jan 16;385(2):595-605]. I then downloaded the full PDF version of the paper and read it carefully through. This work studied binding interactions of benzenes with the internal cavity of L99A mutated T4 lysozyme. The authors demonstrated that the center of the phenyl ring can be shifted by more than one angstrom due to different halogen-substitutions (where the 3DNA entry corresponds to C6H5I), and (further) proved the concept that "the protein is flexible and adapts to the size and shape of the ligand". At better than 2.0 Å resolution, they also observed the I...S and I...Se halogen-bonds.

I became interested in this paper not just because of the name of 3DNA, for which a quick browsing over the abstract would be sufficient. This paper also reminded me of an early article I published with the title "Influence of fluorine on aromatic interactions":

Non-covalent interactions between aromatic ligands influence the conformations of metal complexes, and the system [M(OAr)₂L₂] has been used to investigate the difference between phenyl–phenyl, phenyl–pentafluorophenyl and pentafluorophenyl–pentafluorophenyl interactions. X-Ray crystal structures show that pentafluorophenyl groups adopt partially stacked orientations with the two aromatic rings close to parallel and with significant π overlap. In contrast, phenyl groups are skewed away from each other with only edge-to-face contacts. Phenyl–pentafluorophenyl interactions adopt a coplanar fully stacked geometry. These results have been rationalised on the basis of energy calculations (carried out blind) using a variety of empirical models for treating weak non-covalent interactions. The major cause of the different behaviour of the three systems lies in the electrostatic interactions between the π systems.

Knowing of the pattern of a PDB id, — 4 characters long: the first character is a numeral in the range 0-9, while the rest can be either numerals or letters — I played around with some other possible ids with my name initials in it. Indeed I found one, 1XJL, a protein structure of human annexin A2 in the presence of calcium ions. If you are bimolecular structure-oriented, why not have a try with some ids of special meaning to you — you might be related to PDB in some unexpected way!