Saturday, December 19, 2009

ORCID -- an international research identification system?

From the Nature news article titled "Credit where credit is due" in (462:7275, p. 825 on December 17, 2009), I came cross the ORCID initiative:
Name ambiguity and attribution are persistent, critical problems imbedded in the scholarly research ecosystem. The ORCID Initiative represents a community effort to establish an open, independent registry that is adopted and embraced as the industry’s de facto standard. Our mission is to resolve the systemic name ambiguity, by means of assigning unique identifiers linkable to an individual's research output, to enhance the scientific discovery process and improve the efficiency of funding and collaboration.
Overall, I think it is a good idea. If properly implemented and widely adopted, ORCID could help solve lots of issues associated with various ways of spelling a person's name due to, e.g., cultural differences. For example, put Chinese way, one's family name comes before one's given name, just the opposite of the western convention. Additionally, when a given name has two characters (quite common), there are could be a space or a hyphen (as I normally put in Xiang-Jun) or nothing in between. Combined with possible first name initials, there are already many ways to spell out a Chinese name.

The above Nature article, “Credit Where Credit is Due”, helps introduce the ORCID Initiative. As an specific example, it points to another article on page 843, where Nature profiles a research group trying to "complete the reference human genome sequence, which is still full of errors nearly a decade after the first draft was announced in 2000." Nature acknowledges that "It is essential work", "But it is also work that offers few academic rewards beyond the satisfaction of a job well done — it is unlikely to result in a high profile publication." Hopefully, by adopting the ORCID system, contributions of such types (e.g., software support and maintenance) would be more properly acknowledged by the scientific community.

Given the high profile of the founding parties, I am hopeful that the ORCID initiative would move forward as promised. I will keep an eye on it and see how it evolves.

Friday, December 18, 2009

Ribosomal structure: it helps to know some background information

This year's Nobel Prize in Chemistry has been awarded to Venki Ramakrishnan, Tom Steitz, and Ada Yonath "for studies of the structure and function of the ribosome."

My connection with ribosomal structure began with the 50S large subunit of Haloarcula marismortui solved by Tom Steitz's group. Ever since the fully refined crystal structure at 2.4 Å resolution was published in 2001 (PDB entry: 1jj2; NDB code: rr0033), I have been using it to check 3DNA's applicability. In the two 3DNA papers (2003 NAR and 2008 NP), 1jj2 was used as an example to illustrate how find_pair can identify higher-order base-associations in complicated RNA containing structures. At the time, though, my understanding of the ribosomal RNA structure was purely geometrical: for quite a while, I got overwhelmed by the various biological terminologies, including the various S-es: 50S large ribosomal subunit vs. the 23S and 5S rRNA; and of course, the 30S small subunit vs. 16S rRNA.

Over the past year or so, I have become more interested in RNA structures. After reading a lot of related articles, gradually I feel things are becoming clearer than before. Nevertheless, there is something still missing, since my focus has (mostly) been on recent X-ray crystal structure-related work. My understanding of the ribosomal structure was finally put into context, thanks to following two recent publications:
These two papers not only summarized the significance of work of the three Nobel laureates — "the atomic resolution structures of the ribosomal subunits provide an extraordinary context for understanding one of the most fundamental aspects of cellular function: protein synthesis" — but also provided background information of decades of work from other players, including Harry Noller, Peter Moore, and Joachim Frank. Solving the ribosomal structure serves as a good example of how the fact that scientific research is both cooperative and competitive in nature.

Friday, December 11, 2009

Not all PDB entries are reliable; some could be plain fake

With interest, I have browsed the recent thread in the PDB mailing list (pdb-l), "Retraction of 12 Structures" posted by Michael Sadowski and followed-up by Kevin Karplus et al. The story is about Krishna Murthy, a former scientist at the University of Alabama at Birmingham (UAB), who has been alleged to fabricate protein structures and published papers on them. Here is an informative comment by firebug36 from the above link:
I am a protein crystallographer myself, so just trust me - the results this gentleman [Murthy] published were falsified, and not in a smart way. The structures [for C3b] deposited in the Protein Data Bank made no physical sense.

Allegations against UAB group were first brought to light by several prominent people in the field, and not UAB officials:

http://www.nature.com/nature/journal/v448/n7154/full/nature06102.html

Accordingly to the post of Kevin Karplus, "several of the PDB files by Krishna Murthy's group were identified as problematic in the RosettaHoles paper". Naturally, then, comes the question, "should we remove ALL the PDB files from Krishna Murthy's group as suspect?"

The way Murthy's case coming to spotlight may represent an exception rather than norm. Imagine the scenario that he did not publish his C3b structure in Nature which caught the attention from leading crystallographers (Bert Janssen1, Randy Read2, Axel Brünger and Piet Gros), maybe Murthy is still publishing on protein structures today. In a sense, it is a hard to believe how Murthy could falsify 12 protein structures and published 9 papers in prestigious journals (including Nature, Cell, PNAS, JMB, Biochemistry, JBC etc) which have been cited 449 times.

PDB contains the state-of-the-art experimental data of bio-macromolecular structures. Yet, the archive is certainly full of inconsistencies/errors of various types. It would be helpful to know how many PDB entries are largely or partially wrong, and which can be taken as "gold standard" as far data quality is concerned.

This case gives an excellent lesson for those performing data-mining on macromolecular structures. Nowadays, PDB structures are many and keep increasing rapidly, but they are clearly of varying quality. Structural bioinformatics is about solving biology problems using informatics tools. Thus knowing the caveats of your data (how reliable are they?) and tools (what are their limitations?) is a prerequisite to draw sound scientific conclusions.

Sunday, December 6, 2009

3DNA in the PCCP nucleic acid simulations themed issue

While checking 3DNA-related citations through Web of Science for this past week, I found a total of nine times, as follows:
  1. Five times to the 3DNA 2003 NAR paper
  2. Once to the 3DNA 2008 NP paper
  3. Three times to the 2001 standard base reference frame paper
Most interestingly, all the citations are from the same nucleic acid simulations themed issue of Physical Chemistry Chemical Physics 11 (45). Honestly, I was quite a bit (nicely) surprised by the fact, so I browsed the articles online. Edited by Charles Laughton and Modesto Orozco, the 2009 PCCP "themed issue exemplifies the rich diversity of cutting-edge research in the field of nucleic acids simulation." Indeed, quite a few well-known experts are among the authors of the two perspectives and 16 papers.

While not an "energetic" person myself, over the years I have been keeping an eye on MD simulations and MM calculations of nucleic acid structures. It is my pleasure to see that the 3DNA is being widely used (certainly more than I originally expected) by the nucleic acid simulations community. Given time, and with a suitable collaborator, I am open to consider adapting 3DNA to currently available MD simulation packages to make life easier for practitioners in this "dynamic" field.

Saturday, December 5, 2009

See the effect of C preprocessing with the -E option of gcc

Recently, I was interested in understanding better of one part of a C program, which is very generic, covering a lot of grounds with preprocessing options (#if ... #endif). However, I would like to see the minimum that covers the section I cared about. I vaguely remembered there is an option in the gcc compiler, from reading the book "An Introduction to GCC" a while ago, that can stop the process right after the preprocessing step (just like -c option stops after the compilation step). A quick check of "man gcc" revealed that it is -E:
-E Stop after the preprocessing stage; do not run the compiler proper. The output is in the form of preprocessed source code, which is sent to the standard output.

Input files which don't require preprocessing are ignored.

After setting the proper macros and running gcc with -E, I could immediately focus on the component to get what I wanted.

I have been using gcc for more than ten years, still there are more handy tricks to uncover.


Here is a simplified example showing the effect of the -E option. The following C function is saved in file "check_gcc_E.c".

--------------------------------------------------------------------------
#define GREETING "Hello World"

void check_gcc_E(void) {
#ifdef VERBOSE
printf("Macro VERBOSE is defined.\n");
printf("So you see: '%s'\n", GREETING);
#endif
printf("Hello everyone!\n");
}
--------------------------------------------------------------------------

gcc -E -DVERBOSE check_gcc_E.c

--------------------------------------------------------------------------
# 1 "check_gcc_E.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "check_gcc_E.c"

void check_gcc_E(void) {

printf("Macro VERBOSE defined\n");
printf("So you see: '%s'\n", "Hello World");

printf("Hello everyone!\n");
}
--------------------------------------------------------------------------

gcc -E check_gcc_E.c

--------------------------------------------------------------------------
# 1 "check_gcc_E.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "check_gcc_E.c"

void check_gcc_E(void) {
printf("Hello everyone!\n");
}
--------------------------------------------------------------------------

Friday, November 20, 2009

Registered COPPA Users in phpBB3

Recently, I received an email from a 3DNA user who registered at the 3DNA forum, but could not see anything at all once logged in. 3DNA forum is based on phpBB3, and has been running for over three years now. So at the very beginning, I thought how could it be? I had never heard of any such problem/complain from 3DNA forum registers before. I even created a temporary test login account and found no problem. So I communicated with the user and asked her/him to log in using my test account, and again everything was fine!

To reproduce the problem, I logged in as the user, and found one thing spurious: the user was in the group of "Registered COPPA Users", not the normal "Registered Users". I did not know what that COPPA stands for. So I googled the phase "Registered COPPA users" and the top hit led me into the phpBB3 document on Group Management, and the section I am interested in reads as follows:
Registered COPPA users are basically the same as registered users, except that they fall under the COPPA, or Child Online Privacy Protection Act, law, meaning that they are under the age of 13 in the U.S.A. Managing the permissions this usergroup has is important in protecting these users. COPPA doesn't apply to users living outside of the U.S.A. and can be disabled altogether.

So a registered COPPA user is, by definition, under the age of 13. By default phpBB3 does not even allow such a child to read any content! In the context of 3DNA forum, this policy simply does not make any sense — the contents (in the public section) are viewable by any one without registration.

It turned out that at registration stage, the first question is: "To continue with the registration procedure please tell us when you were born." Two dynamically generated dates are given, one is "Before" a date defining an age over 13, and the other "On or after" it for below 13. Obviously the 3DNA user mentioned above clicked the wrong button.

After knowing where the problem was and how it was created, fixing it was straightforward. Interestingly, when I then checked the 3DNA forum registered users, I found five of them were in the "Registered COPPA Users" group. Obviously, the previous (wrong) registers did not complain — possibly lost interest in pursuing further, so this issue did not surfaced until recently.

In a real world as we live, what seems simple may not be. Nothing should be taken for granted.

3DNA in PDB

As mentioned previously, PDB makes use of blocview (part of 3DNA) to generate the simple yet effective images for nucleic-acid-containing structures. That's the connection I knew of between 3DNA and PDB. By pure chance, however, I recently noticed the 3DNA entry in PDB — it is actually a protein structure, completely unrelated to the 3DNA software package!

Just out of curiosity, I browsed the abstract of the Liu et al. article, titled "Halogenated benzenes bound within a non-polar cavity in T4 lysozyme provide examples of I...S and I...Se halogen-bonding" [J Mol Biol. 2009 Jan 16;385(2):595-605]. I then downloaded the full PDF version of the paper and read it carefully through. This work studied binding interactions of benzenes with the internal cavity of L99A mutated T4 lysozyme. The authors demonstrated that the center of the phenyl ring can be shifted by more than one angstrom due to different halogen-substitutions (where the 3DNA entry corresponds to C6H5I), and (further) proved the concept that "the protein is flexible and adapts to the size and shape of the ligand". At better than 2.0 Å resolution, they also observed the I...S and I...Se halogen-bonds.

I became interested in this paper not just because of the name of 3DNA, for which a quick browsing over the abstract would be sufficient. This paper also reminded me of an early article I published with the title "Influence of fluorine on aromatic interactions":
Non-covalent interactions between aromatic ligands influence the conformations of metal complexes, and the system [M(OAr)2L2] has been used to investigate the difference between phenyl–phenyl, phenyl–pentafluorophenyl and pentafluorophenyl–pentafluorophenyl interactions. X-Ray crystal structures show that pentafluorophenyl groups adopt partially stacked orientations with the two aromatic rings close to parallel and with significant π overlap. In contrast, phenyl groups are skewed away from each other with only edge-to-face contacts. Phenyl–pentafluorophenyl interactions adopt a coplanar fully stacked geometry. These results have been rationalised on the basis of energy calculations (carried out blind) using a variety of empirical models for treating weak non-covalent interactions. The major cause of the different behaviour of the three systems lies in the electrostatic interactions between the π systems.

Knowing of the pattern of a PDB id, — 4 characters long: the first character is a numeral in the range 0-9, while the rest can be either numerals or letters — I played around with some other possible ids with my name initials in it. Indeed I found one, 1XJL, a protein structure of human annexin A2 in the presence of calcium ions. If you are bimolecular structure-oriented, why not have a try with some ids of special meaning to you — you might be related to PDB in some unexpected way!

Saturday, November 14, 2009

How shear affects twist angle of a dinucleotide step?

A recent post in the 3DNA forum, titled "NUPARM vs X3DNA twist values", made me to rethink the issue of how or why shear affects twist angle of a dinucleotide step.

To me, this problem has long been solved as demonstrated by the following two well-cited publications:
  1. The Tsukuba report, a.k.a., "A Standard Reference Frame for the Description of Nucleic Acid Base-pair Geometry". When Dr. Olson and I were drafting this report, I felt clearly the need to caution the community of the intrinsic correlations between base-pair parameters and the associated step parameters (Figure 3 there) to avoid possible mis-interpretations in structural analysis. This is specially the case for the effect of shear on twist, since the G–U wobble base-pair is common in RNA and it has a ~2.0 A shear.

  2. The 3DNA 2003 NAR paper. There is a subsection on the "Treatment of non-Watson–Crick base pairing motifs", and Figure 3 addressed specially on the issue:
    "Large Shear of the G–U wobble base pair influences the calculated but not the ‘observed’ Twist. The 3DNA numerical values of Twist, 20° (top) and 43° (bottom), differ from the visualization of nearly equivalent Twist suggested by the angle between successive C1'···C1' vectors (finely dotted lines)."
It was thus a bit surprising that such question still popping up. On second thought, however, it is quite understandable: one cannot expect everyone to read that two papers; not to mention remembering such details. So I am glad that this question was brought up to my attention, and it made me thinking possible ways to document more thoroughly the many 3DNA-related "technical details" that are crucial for better understanding of nucleic acid structures.

Coming back to the shear on twist angle issue, the figure at the left shows a G–U wobble pair example (top), and a simple rationale: the base-pair is approximately of 10Å-by-5Å (as defined in SCHNArP/3DNA), so a 2Å shift will lead to an angle:
atan2(2, 10) * 180 / pi = 11.3 degrees
(i.e., the red dotted line relative to the bottom horizontal line).

To a first order approximation, that is the difference between RC8–YC6 (or C1'–C1') vs. the base-centered mean y-axis of the pair for calculating twist angle. So whenever one has a G–U wobble pair next to a normal Watson-Crick pair, there would be ~11 degrees difference in "calculated" twist angle between the two approaches (NewHelix/CEHS/SCHNAaP/NUPARM vs 3DNA/Curve+). Moreover, when a G–U wobble is next to a U–G wobble pair, the difference would be doubled to ~23 degrees!

It is worth mentioning that the issue here (as in other similar cases) is not which number is "correct" or which is "wrong": a number is a number. It is its interpretation that matters, and it is here that "details" do count.

Sunday, November 8, 2009

It's sad to hear that Warren DeLano, author of PyMOL, passed away

From a couple of mailing lists, I heard the sad news that Warren DeLano, author of PyMOL, passed away on Tuesday morning, November 3rd. He was only 37!

I have never met Dr. DeLano personally, nor even I communicated with him by email, but I am very aware of PyMol, the de facto standard nowadays for molecular graphics. In writing 3DNA Nature Protocols paper, I dug more deeply into PyMol. I was impressed by its interactive interface to .r3d files (Raster3D) and the high quality ray-traced images it produced. So I came up with a Perl script (x3dna_r3d2png) to convert automatically from a 3DNA generated .r3d file to a PNG image through the PyMol engine.

Through his seminal contributions to PyMol, Dr. DeLano achieved something very few others in computational chemistry/biology can match: he successfully mobilized literately thousands of software programmers and ordinary users from multi-disciplines to join him to produce phenomenal pictures, each of which is worth a thousand words!

It was due to Dr. DeLano's vision that he made PyMol open source so the community now has the possibility/opportunity to continue support and further improve the software. At this stage, however, no one is likely to knows PyMol code to the depth Dr. DeLano did, not to mention the leadership and enthusiasm that he brought to the project. Whatever the case, the community undoubtedly would appreciate Dr. DeLano's valuable contributions.

Thanks, Dr. DeLano, for bring PyMol to the world!