Xiang-Jun's Corner

Friday, April 16, 2010

3DNA in the June 2010 issue of JBSD on "Current Perspectives on Nucleosome Positioning"

While updating 3DNA citations this week, I noticed five of them are from the same June 2010 issue of JOURNAL OF BIOMOLECULAR STRUCTURE & DYNAMICS (JBSD), which is focused on "Current Perspectives on Nucleosome Positioning". Most of the papers in this issue are from well-known laboratories in computational structural biology. It is my pleasure to see 3DNA being widely used in the important research area of nucleosome positioning.

I browsed through the abstracts of all the papers in this JBSD issue to refresh my knowledge of this field. While DNA sequence surely plays some role in nucleosome positioning, I remain to be convinced of the existence of a nucleosome "code" (yet) in the sense of the generally applicable "genetic code". Overall, DNA is so flexible and the signal is so week, thus allowing for tailored data fitting to specific analysis, which is not transferable to other situations. Clearly, the area is hot, yet still wide open.

Thursday, April 8, 2010

NSMB editorial: "Making your point-by-point"

In the April 2010 issue of Nature Structural & Molecular Biology [NSMB, 17(4)], there is another interesting editorial, titled "Making your point-by-point". This editorial addresses an important issue in the process of publishing papers in peer-reviewed journals, that is: how to make effective point-by-point response to "those ever-demanding editors and reviewers"?

Overall, it can be helpful to put yourself in the reviewer’s shoes and compose a response s/he would find appropriate, where the concerns raised are considered and fully addressed. In its ideal state, the review process is a positive and constructive back and forth, an intellectual discussion in which the manuscript is the ultimate beneficiary.

Here is my re-cap of the main points, as I understand it. I am also taking this opportunity to read this one-page editorial one more time.

What to do?

Keep to the point – "makes a series of [succinct] points in response [directly] to each point raised by the reviewers."
Keep it objective – be diplomatic in your point-by-point response to the reviewers, "even if the reviewer’s wording might have seemed overly strong." You could be forthright in your cover letter to the editors, though.
Keep things under control – "Know when to go to the bench and when to argue."
The scope of things – "Say clearly and succinctly" when "some requests might genuinely be beyond the scope of the manuscript or might simply be unfeasible." "Try not to salami-slice", one strong and solid paper is (much) better than two weak ones!

Some don'ts, especially:

Mentioning celebrity endorsements. "you never know—they could be moonlighting as your most critical anonymous reviewer."
Trying to guess who the reviewers are when communicating to the editors – it does not help. Additionally, you could be plain wrong in your guess (again, you never know) – they are anonymous, literally.

Generally speaking, I think authors should be appreciative of the work of the reviewers and editors. Occasionally, I serve as a reviewer and I know the time and efforts it takes to make a fair and thorough assessment of a manuscript.

It is certainly not just because of politeness that in our 2008 3DNA Nature Protocols paper, we acknowledged:

We also thank the editor and the anonymous reviewers whose comments helped to clarify the presentation of the protocols.

More recently, in our 2010 NAR GpU paper, we acknowledged:

They also thank the anonymous reviewers, whose comments helped clarify the presentation of the manuscript.

Friday, April 2, 2010

What's special about the GpU dinucleotide platform?

Recently, I (together with Drs. Wilma Olson and Harmen Bussemaker – a team with a unique combination of complementary expertise) published a new article in Nucleic Acids Research (NAR): "The RNA backbone plays a crucial role in mediating the intrinsic stability of the GpU dinucleotide platform and the GpUpA/GpA miniduplex". The key findings of this work are summarized in the abstract:

The side-by-side interactions of nucleobases contribute to the organization of RNA, forming the planar building blocks of helices and mediating chain folding. Dinucleotide platforms, formed by side-by-side pairing of adjacent bases, frequently anchor helices against loops. Surprisingly, GpU steps account for over half of the dinucleotide platforms observed in RNA-containing structures. Why GpU should stand out from other dinucleotides in this respect is not clear from the single well-characterized H-bond found between the guanine N2 and the uracil O4 groups. Here, we describe how an RNA-specific H-bond between O2'(G) and O2P(U) adds to the stability of the GpU platform. Moreover, we show how this pair of oxygen atoms forms an out-of-plane backbone ‘edge’ that is specifically recognized by a non-adjacent guanine in over 90% of the cases, leading to the formation of an asymmetric miniduplex consisting of ‘complementary’ GpUpA and GpA subunits. Together, these five nucleotides constitute the conserved core of the well-known loop-E motif. The backbone-mediated intrinsic stabilities of the GpU dinucleotide platform and the GpUpA/GpA miniduplex plausibly underlie observed evolutionary constraints on base identity. We propose that they may also provide a reason for the extreme conservation of GpU observed at most 5'-splice sites.

As a nice surprise, this publication was selected by NAR as a featured article! According to the NAR website:

Featured Articles highlight the best papers published in NAR. These articles are chosen by the Executive Editors on the recommendation of Editorial Board Members and Referees. They represent the top 5% of papers in terms of originality, significance and scientific excellence.

I feel very gratified with the "extra" recognition. From my own perspective, I can easily rank this paper as the top one in my publication list: from the very beginning, I has been struck by the simplicity and elegance of the GpU story. Hopefully, time will verify the validity of this scientific contribution.

Behind the hood, though, there is a long, complex (sometimes perplexing), yet interesting story associated with this work. Here is how it got started. While writing the 3DNA 2008 Nature Protocols (NP) paper, I selected the (previously undocumented) "-p" option of "find_pair" to showcase its capability to identify higher-order base associations, using the large ribosomal subunit (1JJ2) as an example. I noticed the unexpected O2'(G)⋅⋅⋅O2P(U) H-bond within the GpU dinucleotide platform in the pentaplet shown left in Figure A below. I was well aware of Leontis-Westholf's pioneering work on "Geometric nomenclature and classification of RNA base pairs" which involves three distinct edges – the Watson-Crick edge, the Hoogsteen edge, and the Sugar edge, yet without taking into consideration of possible sugar-phosphate backbone interactions (Figure B below). So I decided to double-check, just to be sure that the H-bond was not spurious due to defects in the H-bond detecting scheme of "find_pair", and the results were very surprising.

The following section was re-added into the 3DNA NP paper in the very last revision:

It is also worth noting that the G1971–U1972 platform is stabilized not only by the well-characterized G(N2)⋅⋅⋅U(O4) H-bond interaction, but also by a little-noticed G(O2’)⋅⋅⋅U(O2P) sugar-phosphate backbone interaction (Fig. 6a). Examination of the 50S large ribosomal unit (1JJ2) alone reveals ten such double H-bonded G–U platforms, far more occurrences than those registered by any other dinucleotide platform (including A–A) in this structure. Apparently, the G–U platform is more stable than other platforms with only a single base–base H-bond interaction. We are currently investigating this overrepresented G–U dinucleotide platform in other RNA structures. (p.1226)

Friday, March 26, 2010

What find_pair in 3DNA can do

Structural analysis of nucleic acids used to be a rather tedious process, especially for irregular, complicated RNA structures and nucleic-acid/protein complexes [e.g., the large ribosomal subunit of H. marismortui (1JJ2)]. Without valid base-pairing information arranged properly in a duplex fragment as input, analysis programs such as Curves+ and analyze/cehs in 3DNA would produce meaningless results. The program find_pair in 3DNA was originally created to solve this specific problem, i.e., to generate an input file to 3DNA analysis routines directly from a nucleic-acid containing structure in PDB format. It is what makes nucleic acids structural analysis a routine process — running through thousands of structures from NDB/PDB can be fully automated.

Overall, find_pair has more than fulfilled the goal of its initial design (as stated above). Over the past few years, its functionality has been expanded and continuously refined (kaizen; 改善), making find_pair itself a full-featured application. Now, it is efficient, robust, and its simple command line interface allows for easy integration with other bioinformatics tools. Properly acknowledged or otherwise, find_pair has served (at least) as one of the key components in many other applications (RNAView, BPS, SwS, ARTS, to name just a few). Indeed, find_pair is by far the single program in 3DNA that has received the most questions (as evident from the 3DNA forum).

While I still have to write a method paper to describe the underlying algorithms of find_pair in detail — i.e., for identifying nucleotides, H-bonds, base pairs, high-order base associations, and double helical regions — the basic idea is very intuitive and easy to understand: as summarized in our recent GpU paper, find_pair is purely geometric based (with user adjustable parameters) and allows for the identification of canonical Watson–Crick as well as non-canonical base pairs, made up of normal or modified bases, regardless of tautomeric or protonation state. For example, in the GpU paper, we chose the following set of stringent parameters to ensure that the geometry of each identified base pair is nearly planar and supports at least one inter-base H-bond: (i) a vertical distance (stagger) between base planes ≤ 1.5 Å; (ii) an angle between base normal vectors ≤ 30°; and (iii) a pair of nitrogen and/or oxygen base atoms at a distance ≤ 3.3 Å. Other criteria (documented or otherwise), such as the distance between the origins of the two standard base reference frames, are just filters to speed up the calculations.

In a nutshell, find_pair has the following two core functionalities:

The default is to generate input to the analysis routines in 3DNA (analyze/cehs) for double helices. However, there are many more works under the hood than just identifying base pairs: the base pairs must be in proper sequential order, and each strand must be in 5' to 3' direction, for the calculated step parameters (twist, roll etc) to make sense. Moreover, with the "-c" option, one gets an input file to Curves (but not Curves+, yet); with the "-s" or "-1" option, find_pair treats the whole structure as one single strand, and is useful for getting all backbone torsion angles.

Detect all base pairs (regardless of in double helical regions or not) and higher-oder (3+) base associations with the "-p" option. This feature (in its preliminary form) was there starting from at least v1.5, which was released at the end of 2002 (just before I left Rutgers), but it was intentionally not documented. The source code of find_pair (as part of 3DNA) was tested and shared within Rutgers (NDB and Dr. Olson's laboratory) before any 3DNA paper was published, and served as the basis for several other projects. We also offered 3DNA (with source code) to a few RNA experts for comments; but we received either no responses or politely-worded negative ones. Things did not work out as (what I thought) they should have been, but that's life and I have learned my lessons. The "-p" option was first explicitly mentioned in the 3DNA 2008 Nature Protocols paper, to illustrate how to identify the two pentaplets in the large ribosomal subunit of H. marismortui (1JJ2).

It is interesting to mention the two papers I've recently come across: the first is on DNA-protein interactions and the second on RNA base-pairing, where new algorithms were developed to detect base pairs and their performances were compared with find_pair. In each of the two cases, it was claimed that find_pair missed certain pairs where the new methods succeeded. As it turned out, however, in the first case, simply relaxing find_pair's default H-bond distance cut-off 4.0 Å to 4.5 Å, as used by the authors, virtually all the missing pairs were recovered. In the second case, the "-p" option, which should have been, was simply not specified.

After nearly a decade of extensive real-world applications and refinements, it is safe to say that find_pair is now a versatile and practical tool for nucleic acids structure analysis. Of course, I will continue to support and further refine find_pair as I see fit. Once in a while, I just cannot stop but to think that find_pair is to nucleic acids what DSSP is to proteins: simple and elegant. As more people become aware of its existence, I would expect find_pair to gain even more widespread usage, especially in RNA-structure related research areas.

Saturday, March 20, 2010

One computer, three operating systems

While so far I have been quite happy with my new MacBook Pro, running Mac OS X 10.6 (Snow Leopard), I still feel more comfortable with the Ubuntu Linux programming environment I have been using for the past few years. Moreover, to make sure that my software (e.g., 3DNA) is strictly ANSI C compliant, and compiles without changes on the most commonly used operating systems (OSes), I need to have direct access to Linux and Windows. Luckily, the Intel-based hardware architecture of MacBook Pro and the free VirtualBox software make it possible to have the three OSes – Mac OS X, Ubuntu Linux, and Windows – in one computer.

Installing VirtualBox on Mac OS X was a snap. Specifically, I added the following two guest OSes:

Windows XP, with 1 GB RAM and 70 GB (virtual) hard disk
Ubuntu 9.10, with 2 GB RAM and 90 GM disk space

For seamless integration between each of the two guest OSes and the host Mac OS X, and for improved performance, I also created shared folders and installed guest additions for Windows and Linux. For Windows XP, the process had been quite straight forward. For Linux guest addition, however, I had some problems and solved them by following the instructions on "How To Install VirtualBox Guest Additions in Linux".

Now in Fullscreen Mode (command-F), I can run Ubuntu Linux or Windows XP as if it is native for each. Very cool!

Saturday, March 13, 2010

Hoogsteen base-pair

The A·U (or A·T) Hoogsteen pair is a well-known base pair (bp), named after the scientist who discovered it. As shown in the Figure below (left), in the Hoogsteen bp scheme, adenine uses its N7 and N6 atoms (at the major groove edge) to form two H-bonds with the N3 and O4 atoms from uracil, respectively. Interestingly, if the uracil base ring is flipped around the N7(A)…N3(U) H-bond by 180 degrees, N6(A) can also form an H-bond with O2(U), i.e., N6(A)…O2(U): this pairing scheme is called the reverse Hoogsteen bp (right).

I first came to know about the Hoogsteen bp from Saenger's book ("Principles of Nucleic Acid Structure"). Over the years, I have read many articles mentioning the Hoogsteen bp and touched this topic myself in the 2003 3DNA NAR publication. However, I have never read Hoogsteen's two original publications on this topic until recently:

The two-page long preliminary report, titled "The structure of crystals containing a hydrogen-bonded complex of 1-methylthymine and 9-methyladenine", was published in Acta Cryst. (1959). 12, 822-3. The paper contained only a single reference to the Watson-Crick DNA structure paper, published in Nature in 1953. I found it very revealing to understand why Hoogsteen used the methyl-ed derivatives of thymine and adenine, and how the failed initial interpretation of the experimental "vector-density map" using the Watson-Crick A-T bp led to the discovery of the new base-pairing scheme:
The fact that the first trial structure could not be refined led to a more critical scrutiny of the generalized projection and a greater emphasis on the significance of certain spurious peaks and on relatively large variations in the heights of peaks that were assumed to represent atoms. The correct structure was finally discovered by changing the positions of a few atoms in the 9-methyladenine portion of the asymmetric unit.
The more extensive account of the Hoogsteen bp story, titled "The Crystal and Molecular Structure of a Hydrogen-Bonded Complex Between 1-Methylthymine and 9-Methyladenine", published in Acta Cryst. (1963) 16, 907-16.

I like these two papers, and more generally those focused-articles, where authors get directly to a point and addressed it thoroughly and clearly. Most publications nowadays are very ambitious, trying to solve "big problems": the papers are generally far more complicated and often have "reproducibility" problems.

As a side note, the term Hoogsteen "edge" appears quite frequently in today's publications of RNA structures: in the Leontis-Westhof bp classification scheme, the term simply means the major groove edge in what would be a Watson-Crick bp geometry.

Saturday, March 6, 2010

MacMost, a valuable resource to help you get the most of your Mac OS X

Upon receiving my new Mac OS X Snow Leopard, I googled around, trying to find some tutorials on the web. Somehow, I came across a video clip by Gary Rosenzweig. I then visited MacMost.com and watched more video clips over there during the past week.

Overall, I like the videos quite a bit: ~5 minutes long each, these podcasts show various Mac-related tips and tricks in an easy to follow fashion. Specifically, I like the following:

#347: "Quick Look" – a functionality of looking at the contents of a file without opening it. "Quick Look" seems to be unique to Mac OS X since I am not familiar with it in Linux and Windows. So far, I have found it especially handy in Mail for quickly checking contents of attachments.
#357: "Do Macs Need Anti-Virus Software?" – it is assuring to know that "There are currently no active Mac viruses", and helpful to be aware that "anti-virus software could cause unexpected problems."
#363: "Learning to Program with Scratch" – it is from this video clip that I came to know the Scratch programming language from the MIT Media Lab. Unlike professional computer languages such as C, C++, Java, Ruby, Perl etc, Scratch targets the general public, especially for kids to learn mathematical and computational ideas by programing using a simple drag-and-drop interface. Using Scratch, it is really easy and cool create interactive stories and animations, and to share them on the web.

I will certainly keep visiting back MacMost.com and watch more videos as they become available. Little by little, I will learn new tricks to make my Mac life more enjoyable.

Friday, February 26, 2010

Mac OS X Snow Leopard -- I'm loving it (mostly)!

Recently, when it was time for a new laptop, I decided to buy a MacBook Pro (Intel-based with Mac OS X 10.6.2 -- Snow Leopard). Over the past few days, I have been playing around with it, migrating files from my Ubuntu Linux box. So far, things have gone through smoothly, thus by and large, I am enjoying my new Mac.

Over the years, I have been using Ubuntu Linux and I have been very happy with it, especially for software development. Lyx and OpenOffice are handy for writing technical documents. However, I have realized that when it comes to write a manuscript for publication, and to communicate effectively with non-Linux collaborators, MS Word (with EndNote) is the standard. So I set up a Windows XP virtual machine via VirtualBox on my Ubuntu Linux box, which avoids the problem of dual booting and allows for easy file sharing between Linux and Windows.

Mac OS X is Unix/Linux based but has native support for MS Office and Adobe Suite of programs, so it seems an ideal choice for a new laptop. Mac OS X 10.6 (Snow Leopard) is claimed to be "The world's most advanced operating system. Finely tuned." Other things aside, I do appreciate the fact that 10.6 (Snow Leopard) is a refinement of 10.5 (Leopard) from installation to shutdown -- "In ways big and small, Mac OS X Snow Leopard makes your Mac faster, more reliable, and easier to use."

So far, I have configured Mail to access my Columbia emails. I must say that Mail is way better than Columbia's CubMail web-interface, and I like Mail's native integration with iCal and Address Book. Safari still needs some getting used to, from my mostly Firefox experience. However, it is nice to find that some websites, which does not work in Firefox but IE, display properly with Safari. Preview appears to be powerful for PDF and image viewing and manipulations. I have installed Xcode, and may explore it more, if nothing but to see what an IDE has to offer. Of course, it is nice to have direct access to MS Office (mostly for Word and PowerPoint, so no need to play around with OpenOffice), EndNote, Adobe (Acrobat, Photoshop, Illustrator), etc.

Some nuisances up to this point:

Keyboard missing numeric keypad and Home/End/PgUp/PgDn
Ctrl-C/V etc keyboard shortcuts I am used to now become Command-C/V etc
File and directory names are not case-sensitive -- most surprising!

Overall, my new MacBook Pro is a very nice toy to play. As I become more familiar with it, I may like it more, hopefully.

Saturday, February 6, 2010

NSMB editorial: "Scientific writing 101"

In the February 2010 issue of Nature Structural & Molecular Biology [NSMB, 17(2), p.139], there is a nice Editorial titled "Scientific writing 101". This short one-page essay is a good example of a (scientific) writing that is "a pleasures of reading".

"Less is more when it comes to writing a good scientific paper. Tell a story in clear, simple language and keep in mind the importance of the ‘big picture’."

Specifically, the editorial makes the following points:

Tell a story. A scientific paper is not a chronology; the data should be presented and interpreted in context.
Be clear. "Clear, simple language allows the data and their interpretation to come through."
Provide an informative title and abstract. "Make the abstract clear and try to get the ‘big picture’ across."
Make the introduction short and concise.
Clearly distinguish Results from Discussion. "Discussion should put those results in a broader context." It "should be an interpretation of those results..."
Cover letter is important. You should spell check your manuscript, and number the pages, etc.

In this blog post, I am just recapping the key points of the editorial, and taking the opportunity to re-read it. Following the simple principles outlined in the editorial would be beneficial to everyone in the scientific community.