Xiang-Jun's Corner

Saturday, May 8, 2010

Some key combinations in MacBook Pro (Snow Leopard)

One of the inconveniences I experienced when first switching to a MacBook Pro running Mac OS X (Snow Leopard) was its missing of the PgUp, PgDn, Home and End keys. Over the past few months, I have learned that the functionality of such 'convenience' keys can be achieved via a combination of the arrow-keys (bottom right), with the 'fn', 'option', or 'command' keys (bottom left), as follows:

fn + right-arrow – end of a document
fn + left-arrow – beginning of a document
fn + up-arrow – Page up
fn + down-arrow – Page down
command + right-arrow – beginning of a line (Home)
command + left-arrow – end of a line (End)
option + right-arrow – one word to the right
option + left-arrow – one word to the left

According to Wikipedia, the 'fn' key "is a modifier key on many keyboards, especially on laptops, used in a compact layout to combine keys which are usually kept separate."

In addition to modifying cursor movements as noted above, the 'fn' key in Mac OS X can also be used to switch the default functionalities of F1 to F12 (e.g., F1 and F2 for screen brightness control) to the standard function keys. As an example, in MS Word, the keyboard shortcut for toggling case is "shift + f3", a trick I recently learned. In the default setting, one must press "fn + shift + f3" to achieve the desired effect.

Any tricks to share? I'd like to hear them!

Saturday, May 1, 2010

One year of blogging

When I checked the date of my first blog post today, I was a bit surprised to find that it is exactly one year since I begin to blog on May 2, 2009. Altogether, I have written over 60 posts, slightly more than one per week. At this time, I feel it appropriate to summarize my thought on blogging in general, to provide a perspective to those who care to visit here and make comments.

Why? The initial motivation was to use blog as a platform to express my personal views on issues I am interested in. As made clear in my first post, "this is Xiang-Jun's Corner on the Internet: all views are mine, and I am opinionated." Over the time, the blog posts have served as a convenient notebook (searchable and archived) either for my personal reference, or to refer others to a particular post (e.g., "On maintaining the 3DNA forum" when being asked a 3DNA-related question via email).
What? "Random thoughts, mostly on scientific issues". I write only on issues I am familiar with and feel comfortable to say something, to the limit that I can respond quickly and concretely to users comments. I am always open to suggestions and will be prompt in acknowledging errors and making corrections. So far, the largest portion of the posts has been devoted to nucleic acid structures in general, and 3DNA-related topics in particular.
How often? Due to time constraints, I will try to write one post per week at the minimum, maybe two, or in rare occasion three. Exceptions are possible, but by and large, I will aim for ~100 posts per year.
Comment? Currently, the policy is set such that "Anyone - includes Anonymous Users" can make a comment, and the comments are moderated. So far, I have always approved all the comments as soon as I see them in my gmail alert, and follow up where appropriate. Note that due to global time difference, commenters may experience some lag in time.
Does it work? Not unexpectedly, my blog has gradually attracted attentions from quite a broad audience, especially those interested in nucleic acid structures (3DNA), including leading scientists in the field (I know from emails I have received).
Hot posts? According to Google Analytics, the most frequently visited nine posts are as follows (with posting date in parentheses):
1. Curves+ vs 3DNA (Sunday, August 16, 2009)
2. Does 3DNA work for RNA? (Friday, July 10, 2009)
3. Two web-interfaces to 3DNA, and more (Sunday, July 5, 2009)
4. Fit a least squares plane to a set of points (Saturday, August 22, 2009)
5. Two 3DNA figures made into a textbook on structural biology (Sunday, May 3, 2009)
6. Chemical diagram of Watson-Crick base-pairs (Saturday, January 23, 2010)
7. What's special about the GpU dinucleotide platform? (Friday, April 2, 2010)
8. Double helix groove width parameters from 3DNA (Saturday, September 5, 2009)
9. How to calculate torsion angle? (Saturday, October 31, 2009)
While some of the posts are well-expected to be in the list, a few of them (e.g., ls-plane fitting, calculation of torsion angle) could look a bit surprising. It does, however, verify an observation based on my personal experience and intuitive feeling about a technical niche that my expertise can make a difference.

Again, as I wrote in my first blog post, "Now the ball is rolling, and only time can tell where the destination will be -- but surely it will no longer stand where it was!" One year later, I can confidently say that the ball in rolling in the right direction, as I'd have hoped for. Of course, I know for sure more time and efforts are need to move to the next level, and I value your feedback!

Friday, April 23, 2010

Life is complicated -- is there a way to make it simpler?

To mark the 10th anniversary on completion of the draft sequence of human genome, the April 01, 2010 issue of Nature [464 (7289)] published a series of interesting and revealing articles, including an Editorial and (historical) accounts from Francis Collins and Craig Venter. I browsed through the whole list, and I especially liked the News article titled "Life is complicated" by Erika Check Hayden, a senior reporter for Nature. I was attracted by its catchy title and brief summary: "The more biologists look, the more complexity there seems to be. Erika Check Hayden asks if there's a way to make life simpler." Obviously, the title of this blog post was inspired by the sources.

Over the past decade, the Human Genome Project has helped to clarify the number of genes from previously assumed ~100,000 to the "true" number of only ~21,000. The dramatically reduced number of genes illustrates the crucial importance of non-coding (used to be called "junk") DNA to biology, yet what non-coding DNA does is still befuddling. Nowadays, we are faced with data deluge from sequencing, gene expression, protein (transcription factor) binding, and other new technologies. The complexity of biology has grown significantly, instead of simplified, even for the most extensively studied protein p53. As put by Jennifer Doudna, “The more we know, the more we realize there is to know.”

The community has gradually realized that information gathering does not always bring corresponding increase of meaningful biological insights. We are facing an age of “drowning in information, starved for knowledge.” For example, systems biology, a new discipline “supposed to help scientists make sense of the complexity”, has turned out that “In many cases, the models themselves quickly become so complex that they are unlikely to reveal insights about the system, degenerating instead into mazes of interactions that are simply exercises in cataloguing.” I cannot agree more with the comment by Leonid Kruglyak: it is naive to think that “you can simply take very large amounts of data and run a data-mining program and understand what is going on in a generic way.”

Are we lost in the sea of biological data? Not necessarily. Eric Davidson's work is an excellent example in “taking smarter systems approaches” to reveal overarching biological rules. Instead of a “machine learning” type top-down approach, the “insights have come when scientists systematically analyse the components of processes that are easily manipulated in the laboratory — largely in model organisms. They’re still using a systems approach, but focusing it through a more traditional, bottom–up lens.” Through this systemic bottom-up approach, Davidson's group has deciphered the mechanism of how gene expressions are controlled through regulatory interactions and specify the construction of sea-urchin’s skeleton.

Interestingly, Eric Davidson gave a seminar at C2B2 on Thursday, April 22, titled "Causal Systems Biology: the Sea Urchin Embryo Gene Regulatory Network". I was very impressed by his talk, and the points he made in the Nature News article.

Friday, April 16, 2010

3DNA in the June 2010 issue of JBSD on "Current Perspectives on Nucleosome Positioning"

While updating 3DNA citations this week, I noticed five of them are from the same June 2010 issue of JOURNAL OF BIOMOLECULAR STRUCTURE & DYNAMICS (JBSD), which is focused on "Current Perspectives on Nucleosome Positioning". Most of the papers in this issue are from well-known laboratories in computational structural biology. It is my pleasure to see 3DNA being widely used in the important research area of nucleosome positioning.

I browsed through the abstracts of all the papers in this JBSD issue to refresh my knowledge of this field. While DNA sequence surely plays some role in nucleosome positioning, I remain to be convinced of the existence of a nucleosome "code" (yet) in the sense of the generally applicable "genetic code". Overall, DNA is so flexible and the signal is so week, thus allowing for tailored data fitting to specific analysis, which is not transferable to other situations. Clearly, the area is hot, yet still wide open.

Thursday, April 8, 2010

NSMB editorial: "Making your point-by-point"

In the April 2010 issue of Nature Structural & Molecular Biology [NSMB, 17(4)], there is another interesting editorial, titled "Making your point-by-point". This editorial addresses an important issue in the process of publishing papers in peer-reviewed journals, that is: how to make effective point-by-point response to "those ever-demanding editors and reviewers"?

Overall, it can be helpful to put yourself in the reviewer’s shoes and compose a response s/he would find appropriate, where the concerns raised are considered and fully addressed. In its ideal state, the review process is a positive and constructive back and forth, an intellectual discussion in which the manuscript is the ultimate beneficiary.

Here is my re-cap of the main points, as I understand it. I am also taking this opportunity to read this one-page editorial one more time.

What to do?

Keep to the point – "makes a series of [succinct] points in response [directly] to each point raised by the reviewers."
Keep it objective – be diplomatic in your point-by-point response to the reviewers, "even if the reviewer’s wording might have seemed overly strong." You could be forthright in your cover letter to the editors, though.
Keep things under control – "Know when to go to the bench and when to argue."
The scope of things – "Say clearly and succinctly" when "some requests might genuinely be beyond the scope of the manuscript or might simply be unfeasible." "Try not to salami-slice", one strong and solid paper is (much) better than two weak ones!

Some don'ts, especially:

Mentioning celebrity endorsements. "you never know—they could be moonlighting as your most critical anonymous reviewer."
Trying to guess who the reviewers are when communicating to the editors – it does not help. Additionally, you could be plain wrong in your guess (again, you never know) – they are anonymous, literally.

Generally speaking, I think authors should be appreciative of the work of the reviewers and editors. Occasionally, I serve as a reviewer and I know the time and efforts it takes to make a fair and thorough assessment of a manuscript.

It is certainly not just because of politeness that in our 2008 3DNA Nature Protocols paper, we acknowledged:

We also thank the editor and the anonymous reviewers whose comments helped to clarify the presentation of the protocols.

More recently, in our 2010 NAR GpU paper, we acknowledged:

They also thank the anonymous reviewers, whose comments helped clarify the presentation of the manuscript.

Friday, April 2, 2010

What's special about the GpU dinucleotide platform?

Recently, I (together with Drs. Wilma Olson and Harmen Bussemaker – a team with a unique combination of complementary expertise) published a new article in Nucleic Acids Research (NAR): "The RNA backbone plays a crucial role in mediating the intrinsic stability of the GpU dinucleotide platform and the GpUpA/GpA miniduplex". The key findings of this work are summarized in the abstract:

The side-by-side interactions of nucleobases contribute to the organization of RNA, forming the planar building blocks of helices and mediating chain folding. Dinucleotide platforms, formed by side-by-side pairing of adjacent bases, frequently anchor helices against loops. Surprisingly, GpU steps account for over half of the dinucleotide platforms observed in RNA-containing structures. Why GpU should stand out from other dinucleotides in this respect is not clear from the single well-characterized H-bond found between the guanine N2 and the uracil O4 groups. Here, we describe how an RNA-specific H-bond between O2'(G) and O2P(U) adds to the stability of the GpU platform. Moreover, we show how this pair of oxygen atoms forms an out-of-plane backbone ‘edge’ that is specifically recognized by a non-adjacent guanine in over 90% of the cases, leading to the formation of an asymmetric miniduplex consisting of ‘complementary’ GpUpA and GpA subunits. Together, these five nucleotides constitute the conserved core of the well-known loop-E motif. The backbone-mediated intrinsic stabilities of the GpU dinucleotide platform and the GpUpA/GpA miniduplex plausibly underlie observed evolutionary constraints on base identity. We propose that they may also provide a reason for the extreme conservation of GpU observed at most 5'-splice sites.

As a nice surprise, this publication was selected by NAR as a featured article! According to the NAR website:

Featured Articles highlight the best papers published in NAR. These articles are chosen by the Executive Editors on the recommendation of Editorial Board Members and Referees. They represent the top 5% of papers in terms of originality, significance and scientific excellence.

I feel very gratified with the "extra" recognition. From my own perspective, I can easily rank this paper as the top one in my publication list: from the very beginning, I has been struck by the simplicity and elegance of the GpU story. Hopefully, time will verify the validity of this scientific contribution.

Behind the hood, though, there is a long, complex (sometimes perplexing), yet interesting story associated with this work. Here is how it got started. While writing the 3DNA 2008 Nature Protocols (NP) paper, I selected the (previously undocumented) "-p" option of "find_pair" to showcase its capability to identify higher-order base associations, using the large ribosomal subunit (1JJ2) as an example. I noticed the unexpected O2'(G)⋅⋅⋅O2P(U) H-bond within the GpU dinucleotide platform in the pentaplet shown left in Figure A below. I was well aware of Leontis-Westholf's pioneering work on "Geometric nomenclature and classification of RNA base pairs" which involves three distinct edges – the Watson-Crick edge, the Hoogsteen edge, and the Sugar edge, yet without taking into consideration of possible sugar-phosphate backbone interactions (Figure B below). So I decided to double-check, just to be sure that the H-bond was not spurious due to defects in the H-bond detecting scheme of "find_pair", and the results were very surprising.

The following section was re-added into the 3DNA NP paper in the very last revision:

It is also worth noting that the G1971–U1972 platform is stabilized not only by the well-characterized G(N2)⋅⋅⋅U(O4) H-bond interaction, but also by a little-noticed G(O2’)⋅⋅⋅U(O2P) sugar-phosphate backbone interaction (Fig. 6a). Examination of the 50S large ribosomal unit (1JJ2) alone reveals ten such double H-bonded G–U platforms, far more occurrences than those registered by any other dinucleotide platform (including A–A) in this structure. Apparently, the G–U platform is more stable than other platforms with only a single base–base H-bond interaction. We are currently investigating this overrepresented G–U dinucleotide platform in other RNA structures. (p.1226)

Friday, March 26, 2010

What find_pair in 3DNA can do

Structural analysis of nucleic acids used to be a rather tedious process, especially for irregular, complicated RNA structures and nucleic-acid/protein complexes [e.g., the large ribosomal subunit of H. marismortui (1JJ2)]. Without valid base-pairing information arranged properly in a duplex fragment as input, analysis programs such as Curves+ and analyze/cehs in 3DNA would produce meaningless results. The program find_pair in 3DNA was originally created to solve this specific problem, i.e., to generate an input file to 3DNA analysis routines directly from a nucleic-acid containing structure in PDB format. It is what makes nucleic acids structural analysis a routine process — running through thousands of structures from NDB/PDB can be fully automated.

Overall, find_pair has more than fulfilled the goal of its initial design (as stated above). Over the past few years, its functionality has been expanded and continuously refined (kaizen; 改善), making find_pair itself a full-featured application. Now, it is efficient, robust, and its simple command line interface allows for easy integration with other bioinformatics tools. Properly acknowledged or otherwise, find_pair has served (at least) as one of the key components in many other applications (RNAView, BPS, SwS, ARTS, to name just a few). Indeed, find_pair is by far the single program in 3DNA that has received the most questions (as evident from the 3DNA forum).

While I still have to write a method paper to describe the underlying algorithms of find_pair in detail — i.e., for identifying nucleotides, H-bonds, base pairs, high-order base associations, and double helical regions — the basic idea is very intuitive and easy to understand: as summarized in our recent GpU paper, find_pair is purely geometric based (with user adjustable parameters) and allows for the identification of canonical Watson–Crick as well as non-canonical base pairs, made up of normal or modified bases, regardless of tautomeric or protonation state. For example, in the GpU paper, we chose the following set of stringent parameters to ensure that the geometry of each identified base pair is nearly planar and supports at least one inter-base H-bond: (i) a vertical distance (stagger) between base planes ≤ 1.5 Å; (ii) an angle between base normal vectors ≤ 30°; and (iii) a pair of nitrogen and/or oxygen base atoms at a distance ≤ 3.3 Å. Other criteria (documented or otherwise), such as the distance between the origins of the two standard base reference frames, are just filters to speed up the calculations.

In a nutshell, find_pair has the following two core functionalities:

The default is to generate input to the analysis routines in 3DNA (analyze/cehs) for double helices. However, there are many more works under the hood than just identifying base pairs: the base pairs must be in proper sequential order, and each strand must be in 5' to 3' direction, for the calculated step parameters (twist, roll etc) to make sense. Moreover, with the "-c" option, one gets an input file to Curves (but not Curves+, yet); with the "-s" or "-1" option, find_pair treats the whole structure as one single strand, and is useful for getting all backbone torsion angles.

Detect all base pairs (regardless of in double helical regions or not) and higher-oder (3+) base associations with the "-p" option. This feature (in its preliminary form) was there starting from at least v1.5, which was released at the end of 2002 (just before I left Rutgers), but it was intentionally not documented. The source code of find_pair (as part of 3DNA) was tested and shared within Rutgers (NDB and Dr. Olson's laboratory) before any 3DNA paper was published, and served as the basis for several other projects. We also offered 3DNA (with source code) to a few RNA experts for comments; but we received either no responses or politely-worded negative ones. Things did not work out as (what I thought) they should have been, but that's life and I have learned my lessons. The "-p" option was first explicitly mentioned in the 3DNA 2008 Nature Protocols paper, to illustrate how to identify the two pentaplets in the large ribosomal subunit of H. marismortui (1JJ2).

It is interesting to mention the two papers I've recently come across: the first is on DNA-protein interactions and the second on RNA base-pairing, where new algorithms were developed to detect base pairs and their performances were compared with find_pair. In each of the two cases, it was claimed that find_pair missed certain pairs where the new methods succeeded. As it turned out, however, in the first case, simply relaxing find_pair's default H-bond distance cut-off 4.0 Å to 4.5 Å, as used by the authors, virtually all the missing pairs were recovered. In the second case, the "-p" option, which should have been, was simply not specified.

After nearly a decade of extensive real-world applications and refinements, it is safe to say that find_pair is now a versatile and practical tool for nucleic acids structure analysis. Of course, I will continue to support and further refine find_pair as I see fit. Once in a while, I just cannot stop but to think that find_pair is to nucleic acids what DSSP is to proteins: simple and elegant. As more people become aware of its existence, I would expect find_pair to gain even more widespread usage, especially in RNA-structure related research areas.

Saturday, March 20, 2010

One computer, three operating systems

While so far I have been quite happy with my new MacBook Pro, running Mac OS X 10.6 (Snow Leopard), I still feel more comfortable with the Ubuntu Linux programming environment I have been using for the past few years. Moreover, to make sure that my software (e.g., 3DNA) is strictly ANSI C compliant, and compiles without changes on the most commonly used operating systems (OSes), I need to have direct access to Linux and Windows. Luckily, the Intel-based hardware architecture of MacBook Pro and the free VirtualBox software make it possible to have the three OSes – Mac OS X, Ubuntu Linux, and Windows – in one computer.

Installing VirtualBox on Mac OS X was a snap. Specifically, I added the following two guest OSes:

Windows XP, with 1 GB RAM and 70 GB (virtual) hard disk
Ubuntu 9.10, with 2 GB RAM and 90 GM disk space

For seamless integration between each of the two guest OSes and the host Mac OS X, and for improved performance, I also created shared folders and installed guest additions for Windows and Linux. For Windows XP, the process had been quite straight forward. For Linux guest addition, however, I had some problems and solved them by following the instructions on "How To Install VirtualBox Guest Additions in Linux".

Now in Fullscreen Mode (command-F), I can run Ubuntu Linux or Windows XP as if it is native for each. Very cool!

Saturday, March 13, 2010

Hoogsteen base-pair

The A·U (or A·T) Hoogsteen pair is a well-known base pair (bp), named after the scientist who discovered it. As shown in the Figure below (left), in the Hoogsteen bp scheme, adenine uses its N7 and N6 atoms (at the major groove edge) to form two H-bonds with the N3 and O4 atoms from uracil, respectively. Interestingly, if the uracil base ring is flipped around the N7(A)…N3(U) H-bond by 180 degrees, N6(A) can also form an H-bond with O2(U), i.e., N6(A)…O2(U): this pairing scheme is called the reverse Hoogsteen bp (right).

I first came to know about the Hoogsteen bp from Saenger's book ("Principles of Nucleic Acid Structure"). Over the years, I have read many articles mentioning the Hoogsteen bp and touched this topic myself in the 2003 3DNA NAR publication. However, I have never read Hoogsteen's two original publications on this topic until recently:

The two-page long preliminary report, titled "The structure of crystals containing a hydrogen-bonded complex of 1-methylthymine and 9-methyladenine", was published in Acta Cryst. (1959). 12, 822-3. The paper contained only a single reference to the Watson-Crick DNA structure paper, published in Nature in 1953. I found it very revealing to understand why Hoogsteen used the methyl-ed derivatives of thymine and adenine, and how the failed initial interpretation of the experimental "vector-density map" using the Watson-Crick A-T bp led to the discovery of the new base-pairing scheme:
The fact that the first trial structure could not be refined led to a more critical scrutiny of the generalized projection and a greater emphasis on the significance of certain spurious peaks and on relatively large variations in the heights of peaks that were assumed to represent atoms. The correct structure was finally discovered by changing the positions of a few atoms in the 9-methyladenine portion of the asymmetric unit.
The more extensive account of the Hoogsteen bp story, titled "The Crystal and Molecular Structure of a Hydrogen-Bonded Complex Between 1-Methylthymine and 9-Methyladenine", published in Acta Cryst. (1963) 16, 907-16.

I like these two papers, and more generally those focused-articles, where authors get directly to a point and addressed it thoroughly and clearly. Most publications nowadays are very ambitious, trying to solve "big problems": the papers are generally far more complicated and often have "reproducibility" problems.

As a side note, the term Hoogsteen "edge" appears quite frequently in today's publications of RNA structures: in the Leontis-Westhof bp classification scheme, the term simply means the major groove edge in what would be a Watson-Crick bp geometry.