Xiang-Jun's Corner

Friday, November 20, 2009

Registered COPPA Users in phpBB3

Recently, I received an email from a 3DNA user who registered at the 3DNA forum, but could not see anything at all once logged in. 3DNA forum is based on phpBB3, and has been running for over three years now. So at the very beginning, I thought how could it be? I had never heard of any such problem/complain from 3DNA forum registers before. I even created a temporary test login account and found no problem. So I communicated with the user and asked her/him to log in using my test account, and again everything was fine!

To reproduce the problem, I logged in as the user, and found one thing spurious: the user was in the group of "Registered COPPA Users", not the normal "Registered Users". I did not know what that COPPA stands for. So I googled the phase "Registered COPPA users" and the top hit led me into the phpBB3 document on Group Management, and the section I am interested in reads as follows:

Registered COPPA users are basically the same as registered users, except that they fall under the COPPA, or Child Online Privacy Protection Act, law, meaning that they are under the age of 13 in the U.S.A. Managing the permissions this usergroup has is important in protecting these users. COPPA doesn't apply to users living outside of the U.S.A. and can be disabled altogether.

So a registered COPPA user is, by definition, under the age of 13. By default phpBB3 does not even allow such a child to read any content! In the context of 3DNA forum, this policy simply does not make any sense — the contents (in the public section) are viewable by any one without registration.

It turned out that at registration stage, the first question is: "To continue with the registration procedure please tell us when you were born." Two dynamically generated dates are given, one is "Before" a date defining an age over 13, and the other "On or after" it for below 13. Obviously the 3DNA user mentioned above clicked the wrong button.

After knowing where the problem was and how it was created, fixing it was straightforward. Interestingly, when I then checked the 3DNA forum registered users, I found five of them were in the "Registered COPPA Users" group. Obviously, the previous (wrong) registers did not complain — possibly lost interest in pursuing further, so this issue did not surfaced until recently.

In a real world as we live, what seems simple may not be. Nothing should be taken for granted.

3DNA in PDB

As mentioned previously, PDB makes use of blocview (part of 3DNA) to generate the simple yet effective images for nucleic-acid-containing structures. That's the connection I knew of between 3DNA and PDB. By pure chance, however, I recently noticed the 3DNA entry in PDB — it is actually a protein structure, completely unrelated to the 3DNA software package!

Just out of curiosity, I browsed the abstract of the Liu et al. article, titled "Halogenated benzenes bound within a non-polar cavity in T4 lysozyme provide examples of I...S and I...Se halogen-bonding" [J Mol Biol. 2009 Jan 16;385(2):595-605]. I then downloaded the full PDF version of the paper and read it carefully through. This work studied binding interactions of benzenes with the internal cavity of L99A mutated T4 lysozyme. The authors demonstrated that the center of the phenyl ring can be shifted by more than one angstrom due to different halogen-substitutions (where the 3DNA entry corresponds to C6H5I), and (further) proved the concept that "the protein is flexible and adapts to the size and shape of the ligand". At better than 2.0 Å resolution, they also observed the I...S and I...Se halogen-bonds.

I became interested in this paper not just because of the name of 3DNA, for which a quick browsing over the abstract would be sufficient. This paper also reminded me of an early article I published with the title "Influence of fluorine on aromatic interactions":

Non-covalent interactions between aromatic ligands influence the conformations of metal complexes, and the system [M(OAr)₂L₂] has been used to investigate the difference between phenyl–phenyl, phenyl–pentafluorophenyl and pentafluorophenyl–pentafluorophenyl interactions. X-Ray crystal structures show that pentafluorophenyl groups adopt partially stacked orientations with the two aromatic rings close to parallel and with significant π overlap. In contrast, phenyl groups are skewed away from each other with only edge-to-face contacts. Phenyl–pentafluorophenyl interactions adopt a coplanar fully stacked geometry. These results have been rationalised on the basis of energy calculations (carried out blind) using a variety of empirical models for treating weak non-covalent interactions. The major cause of the different behaviour of the three systems lies in the electrostatic interactions between the π systems.

Knowing of the pattern of a PDB id, — 4 characters long: the first character is a numeral in the range 0-9, while the rest can be either numerals or letters — I played around with some other possible ids with my name initials in it. Indeed I found one, 1XJL, a protein structure of human annexin A2 in the presence of calcium ions. If you are bimolecular structure-oriented, why not have a try with some ids of special meaning to you — you might be related to PDB in some unexpected way!

Saturday, November 14, 2009

How shear affects twist angle of a dinucleotide step?

A recent post in the 3DNA forum, titled "NUPARM vs X3DNA twist values", made me to rethink the issue of how or why shear affects twist angle of a dinucleotide step.

To me, this problem has long been solved as demonstrated by the following two well-cited publications:

The Tsukuba report, a.k.a., "A Standard Reference Frame for the Description of Nucleic Acid Base-pair Geometry". When Dr. Olson and I were drafting this report, I felt clearly the need to caution the community of the intrinsic correlations between base-pair parameters and the associated step parameters (Figure 3 there) to avoid possible mis-interpretations in structural analysis. This is specially the case for the effect of shear on twist, since the G–U wobble base-pair is common in RNA and it has a ~2.0 A shear.

The 3DNA 2003 NAR paper. There is a subsection on the "Treatment of non-Watson–Crick base pairing motifs", and Figure 3 addressed specially on the issue:
"Large Shear of the G–U wobble base pair influences the calculated but not the ‘observed’ Twist. The 3DNA numerical values of Twist, 20° (top) and 43° (bottom), differ from the visualization of nearly equivalent Twist suggested by the angle between successive C1'···C1' vectors (finely dotted lines)."

It was thus a bit surprising that such question still popping up. On second thought, however, it is quite understandable: one cannot expect everyone to read that two papers; not to mention remembering such details. So I am glad that this question was brought up to my attention, and it made me thinking possible ways to document more thoroughly the many 3DNA-related "technical details" that are crucial for better understanding of nucleic acid structures.

Coming back to the shear on twist angle issue, the figure at the left shows a G–U wobble pair example (top), and a simple rationale: the base-pair is approximately of 10Å-by-5Å (as defined in SCHNArP/3DNA), so a 2Å shift will lead to an angle:

atan2(2, 10) * 180 / pi = 11.3 degrees

(i.e., the red dotted line relative to the bottom horizontal line).

To a first order approximation, that is the difference between RC8–YC6 (or C1'–C1') vs. the base-centered mean y-axis of the pair for calculating twist angle. So whenever one has a G–U wobble pair next to a normal Watson-Crick pair, there would be ~11 degrees difference in "calculated" twist angle between the two approaches (NewHelix/CEHS/SCHNAaP/NUPARM vs 3DNA/Curve+). Moreover, when a G–U wobble is next to a U–G wobble pair, the difference would be doubled to ~23 degrees!

It is worth mentioning that the issue here (as in other similar cases) is not which number is "correct" or which is "wrong": a number is a number. It is its interpretation that matters, and it is here that "details" do count.

Sunday, November 8, 2009

It's sad to hear that Warren DeLano, author of PyMOL, passed away

From a couple of mailing lists, I heard the sad news that Warren DeLano, author of PyMOL, passed away on Tuesday morning, November 3rd. He was only 37!

I have never met Dr. DeLano personally, nor even I communicated with him by email, but I am very aware of PyMol, the de facto standard nowadays for molecular graphics. In writing 3DNA Nature Protocols paper, I dug more deeply into PyMol. I was impressed by its interactive interface to .r3d files (Raster3D) and the high quality ray-traced images it produced. So I came up with a Perl script (x3dna_r3d2png) to convert automatically from a 3DNA generated .r3d file to a PNG image through the PyMol engine.

Through his seminal contributions to PyMol, Dr. DeLano achieved something very few others in computational chemistry/biology can match: he successfully mobilized literately thousands of software programmers and ordinary users from multi-disciplines to join him to produce phenomenal pictures, each of which is worth a thousand words!

It was due to Dr. DeLano's vision that he made PyMol open source so the community now has the possibility/opportunity to continue support and further improve the software. At this stage, however, no one is likely to knows PyMol code to the depth Dr. DeLano did, not to mention the leadership and enthusiasm that he brought to the project. Whatever the case, the community undoubtedly would appreciate Dr. DeLano's valuable contributions.

Thanks, Dr. DeLano, for bring PyMol to the world!

Saturday, October 31, 2009

How to calculate torsion angle?

Given the x-, y-, and z-coordinates of four points in 3-dimensional (3D) space, how to calculate torsion angle? Overall, this is a well-solved problem in structural biology, and one can find detailed description of the algorithm in text books and on-line documents. The algorithm for calculating torsion angle is implementated in virtually every software package in structural chemistry or biology.

Basic as it is, however, in my experience, it is very important to have a detailed appreciation of how the method works in order to really get into the 3D world. Here is a worked example using Octave/Matlab of my simplified, geometry-based implementation of how to calculate torsion angle, including how to determine its sign. No theory or (complicated) mathematical formula, just a step-by-step illustration of how I solve this problem.

Coordinates of four points in variable abcd:

abcd = [ 21.350  31.325  22.681
22.409  31.286  21.483
22.840  29.751  21.498
23.543  29.175  22.594 ];

Two auxiliary functions: norm_vec() to normalize a vector; get_orth_norm_vec() to get the orthogonal component (normalized) of a vector with reference to another vector, which should have been normalized.
```
function ovec = norm_vec(vec)
  ovec = vec / norm(vec);
endfunction

function ovec = get_orth_norm_vec(vec, vref)
  temp = vec - vref * dot(vec, vref);
  ovec = norm_vec(temp);
endfunction
```

Get three vectors: b_c is the normalized vector b→c; b_a_orth is the orthogonal component (normalized) of vector b→a with reference to b→c; c_d_orth is similarly defined, as the orthogonal component (normalized) of vector c→d with reference to b→c.

b_c = norm_vec(abcd(3, :) - abcd(2, :))
  % [0.2703158  -0.9627257   0.0094077]
b_a_orth = get_orth_norm_vec(abcd(1, :) - abcd(2, :), b_c)
  % [-0.62126  -0.16696   0.76561]
c_d_orth = get_orth_norm_vec(abcd(4, :) - abcd(3, :), b_c)
  % [0.41330   0.12486   0.90199]

Now the torsion angle is the angle between the two vectors, b_a_orth and c_d_orth, and can be easily calculated by their dot product. The sign of the torsion angle is determined by the relative orientation of the cross product of the same two vectors with reference to the middle vector b→c. Here they are in opposite direction, thus the torsion angle is negative.
```
angle_deg = acos(dot(b_a_orth, c_d_orth)) * 180 / pi
  % 65.609
sign = dot(cross(b_a_orth, c_d_orth), b_c)
  % -0.91075
if (sign < 0)
  ang_deg = -angle_deg  % -65.609
endif
```

A related concept is the so-called dihedral angle, or more generally the angle between two planes. As long as the normal vectors to the two corresponding planes are defined, the angle between them is easy to work out.

Moreover, the method to calculate twist angles of helical nucleic acid structures in SCHNAaP and 3DNA is essentially the same.

Friday, October 30, 2009

Upgrade to Ubuntu 9.10

I am now upgraded to Ubuntu 9.10 Karmic Koala, released On October 29 by Canonical Ltd. The system is now up and running, even though there are still some (minor) issues to be resolved. Overall, it was an exciting exploration, and Internet and Google search were essential for solving most of the problems.

Normally, I am not that quick to catch up with a new software release, but wait a few more weeks when most initial bugs have been fixed. I was quick this time mainly because I had been trapped into an earlier 9.10 alpha release (in development branch), when I tried to fix a printing problem (without success). Worse yet, suddenly, my VirtualBox to run Windows XP did not start, and SCIM Chinese input stopped functioning, etc. It caused me quite some time and trouble to get my Linux box back to work. So over the past few months, I did not perform any update, but eagerly waiting for the stable release. It is a relief that the new official 9.10 release does solve my problems.

With Ubuntu 9.10, now I have access to OpenOffice 3.1.1 and Lyx 1.6.4 for text processing, GNU Emacs 22.2.1, and GCC 4.4.1, among other things. It is nice to stay current, not only in science, but also in IT.

It is worth mentioning that Ubuntu is "an ethical concept of African origin emphasizing community, sharing and generosity." It is amazing how useful and robust the free, open-source software can be!

Friday, October 23, 2009

Sharing published data?

A letter in the recent issue of Science [VOL 326 23 OCTOBER 2009] titled "The Antidote to Bias in Research" by Allison mentioned the Nature Genetics article on the issue of wide-spread non-repeatability of published microarray gene expression studies. It also touched on another recent study titled "Empirical Study of Data Sharing by Authors Publishing in PLoS Journals" by Savage and Vickers [PLoS ONE 4(9): e7078]:

In another study, refusal to share data despite policies requiring sharing was nearly ubiquitous among authors publishing in Public Library of Science journals.

I then read the Savage and Vickers article, and was surprised to find that only one author out of 10 sent them an original data set. Therefore, the authors wrote:

In conclusion, our findings suggest that explicit journal policies requiring data sharing do not lead to authors making their data sets available to independent investigators.

I can understand why the repeatability rate is so low from the Nature Genetics paper given the so many details associated with a published table or figure. However, it is a bit hard to explain why it is so difficult to share just the original dataset associated with a published work. I would imagine that the authors would be honored that others care about their publications, and contact them directly.

In my experience, I have also been simply ignored frequently for clarifications of some details in published work, or requests for some datasets. The highest successful rate is asking for PDF represents from corresponding authors.

If the basic principles of scientific publications are strictly enforced, even just by the big journals (so many, nowadays), a lot of unfounded big claims would be gone or can be easily seen through. Well, I know that's only possible in an ideal world.

How to quantify the relative geometry between two base-pairs? — Part 2

In part I of this series, and indeed other occasions in my blog, 3DNA home page and forum, I have attributed to several related programs where 3DNA has benefited from. Before going into details on how the various parameters are calculated in 3DNA, however, I feel it is in order to make it clear my philosophy on creating scientific software in general, and 3DNA in particular.

Some underlying considerations

Whenever possible/applicable, I always try to get a thorough understanding of related software available. By thorough I mean at the source code level to see how an algorithm is implemented. The benefits of doing this are several folds:
1. Not to recreate the wheel, but to build upon previous work.
2. It is the most effective way to learn — many math formulae or text descriptions of algorithms, while useful for getting general ideas, are lack in details or vague in nature. In bioinformatics, most of the fundamental mathematics are not new. It is more about a novel combination of various known-parts, applying to a specific problem. Reading the source code is the only way to see unambiguously the implementation details.
3. With a clear understanding of how the parameters are actually calculated, one can make better use of a software tool, know its limitations and thus avoid misinterpretations.
4. To make objective, convincing (even to the original authors) comparisons/comments on existing tools, and importantly, to create something better if needed.

Specially, 3DNA has benefited the most from my SCHNAaP/SCHNArP programs: the underlying algorithms in 3DNA for calculating the various structural parameters (propeller, buckle, shear, roll, slide, x-displacement, inclination etc) are exactly the same as there — analyze and rebuild are direct derivatives from SCHNAaP (a for analysis) and SCHNArP (r for rebuilding/reconstruction), respectively; the ideas of standard stacking diagrams, the Zp parameter, building atomic models with sugar-phosphate backbone, and the base/bp rectangular block representations are also from SCHNAaP/SCHNArP. I also borrowed ideas from Babcock's RNA, Bansal's NUPARM, Dickerson's NewHelix/FreeHelix etc., which will be mentioned explicitly in following sessions.
Code wise, 3DNA was created from scratch in strict ANSI C (over 25K lines). In addition to implementing a unique combination of existing algorithms, I have added many significant novel methods including the find_pair program. However, feature-rich has never been my goal. Instead, I only consider to add new functions that I understand clearly and find useful. On the other hand, I am always quick to fix bugs.

Overall, it seems that I am an outlier rather than norm in scientific programming. The reasons could be as follows:

It is hard to understand other people's implementations of algorithms since it needs to read between the lines. This is especially the case with a unfamiliar computer language; undocumented, or bad code. It is easier to create one's own program than to understand and modify third-party software.
For publication purpose, it is more attractive to work on something "new" instead of building on others.

Thus, in bioinformatics field, there are far more "new" methods papers published than refinements of previous ones. However, what claimed/appeared to be novel may not be the case under the hook, due to different terminologies used in different fields. Moreover, while it is easy to create a self-claimed "new" program, others could simply do the same. In my experience, it is well worth the effort to really understand leading programs if one is serious about getting into an informatics field: it was really revealing and exciting when I went thorough CEHS, NewHelix/FreeHelix, RNA, Curves etc. Then I know clearly what was "the start-of-the-art" of the field and how I can do better: that's exactly how 3DNA came about!

In the following sessions, I will provide details on how 3DNA calculates the two sets of parameters to quantify the relative geometry of a dinucleotide step.

Sunday, October 18, 2009

Duplicate tab function in Adobe READER 9.1 is very handy

In reading a scientific paper, one often needs to jump back and forth for referred to tables, figures and references, etc. Those days, PDF is the standard way to share an e-document, and Acroread is the norm to view its content. I used to open two copies of the same document in two instances of Acroread (or one Acroread, one xpdf) so I can read continuously in one while moving around in the other, mostly at the end for references. This works, but obviously less than ideal.

One day, purely by chance, I (mouse) right-clicked the tab for the document I was viewing, and noticed that it popped up with two options: Detach Tab and Duplicate Tab. Clicking Duplicate Tab led to duplication of the same document in another tab. Actually, this process can be repeated more than once, thus allowing for multiple views of the same document simultaneously. Very neat!

Nowadays, whenever I read a scientific publication in Acroread, I often duplicate tab to have two views. It is only a click away to switch between the two. Thus when a citation is referred to in the main text, I can immediately see at the reference section what it is about.

A word of caution: I am using Ubuntu 9.04, with Acroread v9.1.2 05/25/2009. I have no idea from which version and on what platform such function was added.