Xiang-Jun's Corner

Friday, April 8, 2011

Tips and tricks from "The Geek Stuff"

As a devoted command line user, I am always interested in learning new tricks to make my life more enjoyable. Recently, I came across Ramesh Natarajan’s blog “The Geek Stuff” which is full of “instruction guides, how-to, troubleshooting tips and tricks on Linux, database, hardware, security and web” to help solve practical problems.

For example, in the section “Best of the Blog”, I recently benefitted quite a bit by reading the following posts:

There are many other helpful tips/tricks as well; since I have bookmarked the site, I will surely come back!

Sunday, April 3, 2011

Scripting in Ruby is fun

Over the years, I have played around with various scripting languages, including awk, bash, Perl, Python and Ruby. By far, I have enjoyed Ruby the most; nowadays, I write scripts nearly exclusively in Ruby.

Created by Yukihiro "Matz" Matsumoto in Japan during the mid-1990s, Ruby became popular worldwide in mid-2000s, with the Rails web application framework. Indeed, I first dug into Ruby through Rails, and by reading David Black's book "Ruby for Rails; Ruby techniques for Rails developers". As an exercise, I implemented the current 3DNA v2.0 website with Rails v1.x. Then I quickly realized that the rapidly evolving Rails framework was beyond my time and interest to follow. However, I did begin to appreciate Ruby's simplicity, consistency and expressiveness. Over the past few years, I have collected over a dozen Ruby-related (e)books, including "The Well-Grounded Rubyist" (David Black, covering v1.9), "The Ruby Programming Language" (David Flanagan and Yukihiro Matsumoto), and "Metaprogramming Ruby: Program Like the Ruby Pros" (Paolo Perrotta). Just as my experience with (ANSI) C, I feel Ruby "wears well as one's experience with it grows" (K&R, in the preface of "The C Programming Language"). The better I know Ruby, the more I enjoy using it.

I recently wrote two Ruby scripts for the analysis of molecular dynamics (MD) simulation trajectories using 3DNA. Honestly, I would not have bothered with Perl for the task (otherwise, it would have been done long time ago), given the sideline nature of my support of 3DNA. Yet, writing and refining the Ruby scripts (with help of git and rake) have turned out to be a pleasant experience. Another reason why scripting in Ruby is fun is due to its large, active and friendly user community; there are many user-contributed libraries (gems) that serve well of common programming needs. As an example, in the 3DNA-MD scripts, I took advantage of the elegant Trollop commandline option parser by William Morgan. I picked Trollop among many other choices because it is self-contained in a single file, simple to use, and "gets out of your way".

In the Ruby community, exciting new developments are happening all the time. Recently, I was drawn to thor, "a simple and efficient tool for building self-documenting command line utilities". Over the past couple of years, I have browsed Sinatra and Sequel – they also look brilliant! Of course, for bioinformatics, there is the BioRuby project.

Overall, in my experience, scripting in Ruby is fun and exciting. Are you a Rubist yet?

Saturday, March 26, 2011

DNA fiber models ABC

Among the 55 fiber models available in 3DNA, the A-, B- and C-DNA types are the most generic – they can be built with bases A, C, G and T in any combination (see table below). Moreover, in addition to the well-known Arnott fiber models (#1, #4 and #7, all from calf thymus), there are newer variants from van Dam & Levitt (#46 and #47) and Premilat & Albiser (#53 to #55).

 1   32.7   2.548  A-DNA (calf thymus)
 4   36.0   3.375  B-DNA (calf thymus)
 7   38.6   3.310  C-DNA (calf thymus)
46   36.0   3.38   B-DNA (BI-type nucleotides)
47   40.0   3.32   C-DNA (BII-type nucleotides)
53  -38.7   3.29   C-DNA (depreciated)
54   32.73  2.56   A-DNA [cf. #1]
55   36.0   3.39   B-DNA [cf. #4]

As shown in Figure 9 of the 3DNA 2003 NAR paper (linked below), the A-, B- and C-DNA fiber models are all right-handed regular straight helices, yet each has distinguished features.

While I could easily envisioned possible applications of the fiber models, especially in connection with analysis and rebuilding routines in 3DNA, it was still a nice surprise to see a recent article by Gossett and Harvey, titled "Computational Screening and Design of DNA-Linked Molecular Nanowires" [Nano Lett., 2011, 11 (2), pp 604–608]. The abstract is quoted below:

DNA can be used as a structural component in the process of making conductive polymers called nanowires. Accurate molecular models could lead to a better understanding of how to prepare these types of materials. Here we present a computational tool that allows potential DNA-linked polymer designs to be screened and evaluated. The approach involves an iterative procedure that adjusts the positions of DNA-linked monomers in order to obtain reasonable molecular geometry compatible with normal DNA conformations and with the properties of the polymer being formed. This procedure has been used to evaluate designs already reported experimentally, as well as to suggest a new design based on pyrrylene vinylene (PV) monomers.

In the article, 3DNA (the web interface version w3DNA) was cited as follows:

The selection of DNA structures is important because the DNA remains fixed throughout the procedure. To reduce the risk of an incorrect result, one should choose a subset of DNA structures that are in some sense representative of DNA conformational space. The DNA structures (A-, B-, and C-form DNA) were obtained using the Web 3DNA web server. We used a poly(dG)-poly(dC) sequence with ideal geometry for each DNA structure. A-DNA was constructed with rise = 2.548 Å and twist = 32.7˚ , B-DNA was constructed with rise = 3.375 Å and twist = 36.0 ˚, and C-DNA was constructed with rise = 3.310 Å and twist = 38.6 ˚.

Indeed, this is a novel application of fiber DNA ABC models!

Sunday, March 20, 2011

3DNA citations reach over 500

On Friday, June 5, 2009, I blogged on the topic titled "3DNA citations reach over 300". At that time, I wrote (towards the end):

I still remember that the number of citations to 3DNA was less than 150 nearly two years ago [~ summer 2007], when I started to wrote the first draft of our 2008 Nature Protocols paper. Now it is more than doubled! I would blog on this topic again when the number reaches 500.

When I checked Google scholar for 3DNA citations right now, the citation number is already over 500 for the initial 2003 3DNA NAR paper alone. Combined with the two direct follow-ups – the 2008 Nature Protocols paper and the 2009 NAR web server paper – the three 3DNA publications have been cited a total of 550 times.

Again, as noted in that blog post,

In my opinion, some of 3DNA features are still (heavily) underused. Now that we have a sizable user community, 3DNA could only become better and would be more widely used. I have every reason to believe that in the not-so-distant-future, the citations to 3DNA would reach over 1000.

A decade after its initial humber release, 3DNA has been successfully applied to many real-world problems. As spare time permits, I have actively maintained and continuously refined 3DNA based largely on users' feedbacks. Over the time, I also see clearly that 3DNA can be moved to the next level both in functionality and usability to enjoy an even larger/broader impact.

Now more than half-way through, it won't be long when citations to 3DNA reach 1000, and then beyond.

Sunday, March 13, 2011

Review article on NMR analysis of protein–DNA interactions by Milon et al

Through Google scholar, I became aware of a recent review article by Milon et al., titled "Nuclear magnetic resonance analysis of protein–DNA interactions" in the journal J. R. Soc. Interface:

This review focuses on the experimental strategies currently employed to solve structures of protein–DNA complexes and to analyse their dynamics. It highlights how these approaches can help in understanding detailed molecular mechanisms of target recognition.

I browsed through the text to get myself more familiar with NMR the methodology and its applications in protein-DNA recognition. I was surprised that 3DNA was cited in the article, especially with respect to its unique analyze/rebuild complementarity:

In addition, several software programs have been developed to model DNA bending such as the 3DNA program, which allows analysis of DNA structural parameters and enables it to be rebuilt with customized DNA models [76]. Several Web servers have been created recently and provide interesting tools to analyse and rebuild DNA models [77,78].

I am only wishing that 3DNA's neat features could be more widely recognized; hopefully I'd have the opportunity to further refine 3DNA and move it to the next level.

Saturday, March 5, 2011

Retraction of scientific publications

Once in a while, I come across retraction notices of scientific publications in leading journals/magazines. Even for cases not directly related to my research areas, I normally browse through them.

In the March 3, 2011 issue of Nature, there is a retraction of the Letter "Mediation of pathogen resistance by exudation of antimicrobials from roots" [Nature 434, 217–221 (2005)]. I am intrigued by the first sentence of the note:

The authors wish to retract this Letter after a key reference by Walker et al. (ref. 9 in this Letter) was retracted from the scientific literature.

It turns out that the 2003 Walter et al. J. Agric. Food Chem. paper (withdrawn in October 2009) and the 2005 Nature Letter were from the same group. Overall, it took ~6 years each for the two papers to be retracted. As of today, they have been cited 76 and 84 times respectively accordingly to Google scholar.

Sunday, February 27, 2011

Evidences for transient Hoogsteen base pairs in canonical DNA duplex

In the February 24, 2011 issue of Nature, there is an interesting article by Nikolova et al., titled "Transient Hoogsteen base pairs in canonical duplex DNA". Its main discovery is succinctly summarized in the abstract:

By using nuclear magnetic resonance relaxation dispersion spectroscopy in concert with steered molecular dynamics simulations, we have observed transient sequence-specific excursions away from Watson–Crick base-pairing at CA and TA steps inside canonical duplex DNA towards low-populated and short-lived A•T and G•C Hoogsteen base pairs. The observation of Hoogsteen base pairs in DNA duplexes specifically bound to transcription factors and in damaged DNA sites implies that the DNA double helix intrinsically codes for excited state Hoogsteen base pairs as a means of expanding its structural complexity beyond that which can be achieved based on Watson–Crick base-pairing.

Geometrically, the Hoogsteen base pair is related to the Watson-Crick base pair by a 180-degree rotation about the glycosidic bond (N9–C1'). While the A•T Hoogsteen base pair is classic, the similar G•C+ Hoogsteen pair (with protonation of cytosine N3) is equally possible. The A•T and G•C Hoogsteen base pairs have two perfect H-bonds, so they are energetically stable. As for their existence in DNA duplex, the most direct evidence comes from the "trap" experiments (see Fig.3 of the paper). In the News & Views section, Honig and Rohs provide a nice recap of the main point and implications of this work.

As also observed in another recent publication, "Replication infidelity via a mismatch with Watson–Crick geometry", the base sequence has a subtle role in influencing the base-pairing schemes, three-dimensional structures and biological functions of DNA. However, we should not forget that only the Watson-Crick base pairs, and to a less extent, the G-U wobble pair, have the correct symmetry to ensure a "regular" double helical structure.

Sunday, February 20, 2011

Canned responses in gmail make it easy to send common messages

Through Gary Rosenzweig's MacMost Now video #509 "Gmail Labs" (January 28, 2011), I first heard of "Canned Responses" in Gmail Labs:

Email for the truly lazy. Save and then send your common messages using a button next to the compose form.

This is a truly handy feature that I have long been waiting for! Yet even though I am aware of Gmail Labs and enabled quite a few experimental features a while ago, I've not been searching Gmail Labs for new features ever since.

Over the past few weeks, I have found "Canned Responses" increasingly indispensable in my support of the 3DNA forum (as a sideline project):

When I activate a new 3DNA forum registration, I've always included a "standard" message to "make the forum policy upfront and explicit, in order to avoid misunderstandings or surprises." Previously, I had to copy-and-paste, e.g., from a specifically created text file or elsewhere. Surely, this worked, but I had felt intuitively that there must be a better way to get the job done. Well, that's exactly where "Canned Responses" fit in!
Over the past few months, I have been ever more bothered by spam registrations. So as a further filter, I have been sending the following enquiry message to each suspicious registration for activation:

Thanks for your registration at the 3DNA forum. Please tell me a little bit about yourself and elaborate on how 3DNA could be useful to your project; we would like to make the forum spam-free.

See also "Further notes on forum registration and posting" -- you may not need to register.
Here once again, the "Canned Responses" feature makes my life much easier! Moreover, this step turns out to be extremely effective; a large percentage of registrations is filtered out at this final stage.

Are you using gmail? If so, you may also want to give "Canned Responses" a try.

Sunday, February 13, 2011

Making data maximally available?

In the February 11, 2011 issue of Science (Vol. 331 no. 6018 p. 649), there is an editorial, titled "Making Data Maximally Available". Indeed, the issue contains a special section on "Dealing with Data".

Science is driven by data. New technologies have vastly increased the ease of data collection and consequently the amount of data collected, while also enabling data to be independently mined and reanalyzed by others ... It is obvious that making data widely available is an essential element of scientific research.

Especially, I like the following two (proposed) new policies:

To extended data access requirement "to include computer codes involved in the creation or analysis of data." If properly implemented/enforced, this policy could significantly increase the repeatability and assessment of published results. In my experience, I have observed too many times that secrets are hidden in the seemingly "little" subtle details.
"To produce a single list that combines references from the main paper and the SOM" (supporting online material) to "provide credit and reveal data sources more clearly". Potentially, this will also increase the citation of method papers.

Hopefully, other journals will follow Science's lead to make data maximally available, and to present data more transparently.