新闻 | 论坛 | 生物信息学专题 | 新思路 | 软件下载 | 相关数据库 | 免费主页

网站首页 BioSino Databese BioSino Lab BioSino Navigator 关于本站

 
站内搜索:  
How Perl Saved the Human Genome Project, continued...
<<previous  

Here is where Perl again came to the rescue. The Cambridge summit meeting that introduced this article was called in part to deal with the data interchange problem. Despite the fact that the two groups involved were close collaborators and superficially seemed to be using the same tools to solve the same problems, on closer inspection nothing they were doing was exactly the same.

The main software components in a DNA sequencing projects are:

  • a trace editor to analyze, display and the short DNA read chromatograms from DNA sequencing machines.
  • a read assembler, to find overlaps between the reads and assemble them together into long contiguous sections.
  • an assembly editor, to view the assemblies and make changes in places where the assembler went wrong.
  • a database to keep track of it all.

Over the course of a few years, the two groups had developed suites of software that worked well in their hands. Following the familiar genome center model, some of the components were developed in-house while others were imported from outside. As shown in Figure 2, Perl was used for the glue to make these pieces of software fit together. Between each pair of interacting modules were one or more Perl scripts responsible for massaging the output of one module into the expected input for another.

When the time came to interchange data, however, the two groups hit a snag. Between them they were now using two trace editors, three assemblers, two assembly editors and (thankfully) one database. If two Perl scripts were required for each pair of components (one for each direction), one would need as many as 62 different scripts to handle all the possible interconversion tasks. Every time the input or ouput format of one of these modules changed, 14 scripts might need to be examined and fixed.

The solution that was worked out during this meeting is the obvious one shown in Figure 3. The two groups decided to adopt a common data exchange format known as CAF (an acronym whose exact meaning was forgotten during the course of the meeting). CAF would contain a superset of the data that each of the analysis and editing tools needed. For each module, two Perl scripts would be responsible for converting from CAF into whatever format Module A expects ("CAF2ModuleA") and converting Module A's output back into CAF ("ModuleA2CAF"). This simplified the programming and maintenance task considerably. Now there were only 16 Perl scripts to write; when one module changed, only two scripts would need to be examined.

This episode is not unique. Perl has been the solution of choice for genome centers whenever they need to exchange data, or to retrofit one center's software module to work with another center's system.

So Perl has become the software mainstay for computation within genome centers as well as the glue that binds them together. Although genome informatics groups are constantly tinkering with other "high level" languages such as Python, Tcl and recently Java, nothing comes close to Perl's popularity. How has Perl achieved this remarkable position?

I think several factors are responsible:

1. Perl is remarkably good for slicing, dicing, twisting, wringing, smoothing, summarizing and otherwise mangling text. Although the biological sciences do involve a good deal of numeric analysis now, most of the primary data is still text: clone names, annotations, comments, bibliographic references. Even DNA sequences are textlike. Interconverting incompatible data formats is a matter of text mangling combined with some creative guesswork. Perl's powerful regular expression matching and string manipulation operators simplify this job in a way that isn't equalled by any other modern language.

2. Perl is forgiving. Biological data is often incomplete, fields can be missing, or a field that is expected to be present once occurs several times (because, for example, an experiment was run in duplicate), or the data was entered by hand and doesn't quite fit the expected format. Perl doesn't particularly mind if a value is empty or contains odd characters. Regular expressions can be written to pick up and correct a variety of common errors in data entry. Of course this flexibility can be also be a curse. I talk more about the problems with Perl below.

3. Perl is component-oriented. Perl encourages people to write their software in small modules, either using Perl library modules or with the classic Unix tool-oriented approach. External programs can easily be incorporated into a Perl script using a pipe, system call or socket. The dynamic loader introduced with Perl5 allows people to extend the Perl language with C routines or to make entire compiled libraries available for the Perl interpreter. An effort is currently under way to gather all the world's collected wisdom about biological data into a set of modules called "bioPerl" (discussed at length in an article to be published later in the Perl Journal).

4. Perl is easy to write and fast to develop in. The interpreter doesn't require you to declare all your function prototypes and data types in advance, new variables spring into existence as needed, calls to undefined functions only cause an error when the function is needed. The debugger works well with Emacs and allows a comfortable interactive style of development.

5. Perl is a good prototyping language. Because Perl is quick and dirty, it often makes sense to prototype new algorithms in Perl before moving them to a fast compiled language. Sometimes it turns out that Perl is fast enough so that of the algorithm doesn't have to be ported; more frequently one can write a small core of the algorithm in C, compile it as a dynamically loaded module or external executable, and leave the rest of the application in Perl (for an example of a complex genome mapping application implemented in this way, see http://waldo.wi.mit.edu/ftp/distribution/software/rhmapper/).

6. Perl is a good language for Web CGI scripting, and is growing in importance as more labs turn to the Web for publishing their data.

My experience in using Perl in a genome center environment has been extremely favorable overall. However I find that Perl has its problems too. Its relaxed programming style leads to many errors that more uptight languages would catch. For example, Perl lets you use a variable before its been assigned to, a useful feature when that's what you intend but a disaster when you've simply mistyped a variable name. Similarly, it's easy to forget to declare make a variable used in a subroutine local, inadvertently modifying a global variable.

If one uses the -w switch religiously and turn on the "use strict vars" pragma, these Perl will catch these problems and others. However there are more subtle gotchas in the language that are not so easy to fix. A major one is Perl's lack of type checking. Strings, floats and integers all interchange easily. While this greatly speeds up development, it can cause major headaches. Consider a typical genome center Perl script that's responsible for recording the information of short named subsequences within a larger DNA sequence. When the script was written, the data format was expected to consist of tab-delimited fields: a string followed by two integers representing the name, starting position and length of a DNA subsequence within a larger sequence. An easy way to parse this would to split() into an list like this:

($name,$start_position,$length) = split("\t");  

Later on in this script some arithmetic is performed with the two integer values and the result written to a database or to standard output for further processing.

Then one day the input file's format changes without warning. Someone bumps the field count up by one by sticking a comment field between the name and the first integer. Now the unknowing script assigns a string to a variable that's expected to be numeric and silently discards the last field on the line. Rather than crashing or returning an error code, the script merrily performs integer arithmetic on a string, assuming a value of zero for the string (unless it happens to start with a digit). Although the calculation is meaningless, the output may look perfectly good, and the error may not be caught until some point well downstream in the processing.

The final Perl deficiency has been a way to create graphical user interfaces. Although Unix True Believers know that anything worth doing can be done on the command line, most end-users don't agree. Windows, menus and bouncing icons have become de rigueur for programs intended for use by mere mortals.

Until recently, GUI development in Perl was awkward to impossible. However the work of Nick Ing-Simmons and associates on perlTK (pTK) has made Perl-driven GUIs possible on X-windows systems. My associates and I have written several pTK-based applications for internal use at the MIT genome center, and it's been a satisfying experience overall. Other genome centers make much more extensive use of pTK, and in some places its become a mainstay of production.

Unfortunately, I'm sad to confess that a few months ago when I needed to put a graphical front end on a C++ image analysis program I'd written, I turned to the standard Tcl/Tk library rather than to pTK. I made this choice because I intended the application for widespread distribution. I find pTK still too unstable for export: new releases of pTK discover lurking bugs in Perl, and vice-versa. Further, I find that even seasoned system administrators run into glitches when compiling and installing Perl modules, and I worried that users would hit some sort of road block while installing either pTK or the modules needed to support my application, and owuld give up. In contrast, many systems have the Tcl/Tk libraries preinstalled; if they don't, installation is quick and painless.

In short, when the genome project was foundering in a sea of incompatible data formats, rapidly-changing techniques, and monolithic data analysis programs that were already antiquated on the day of their release, Perl saved the day. Although it's not perfect, Perl seems to fill the needs of the genome centers remarkably well, and is usually the first tool we turn to when we have a problem to solve.

When he's not rushing to meet a deadline, Lincoln sometimes goes out for a coffee.

 
<<previous  


1999-2005 中国科学院上海生命科学研究院生物信息中心  
版权所有 All rights reserved.