Reprinted courtesy of the Perl Journal, http://www.tpj.com/
Lincoln Stein's website is
http://stein.cshl.org/
DATE: Early February, 1996
LOCATION: Cambridge, England, in the conference room of the largest
DNA sequencing center in Europe.
OCCASION: A high level meeting between the computer scientists of this
center and the largest DNA sequencing center in the United States.
THE PROBLEM: Although the two centers use almost identical laboratory
techniques, almost identical databases, and almost identical data analysis
tools, they still can't interchange data or meaningfully compare results.
THE SOLUTION: Perl.
The human genome project was inaugurated at the beginning of the decade as an
ambitious international effort to determine the complete DNA sequence of human
beings and several experimental animals. The justification for this undertaking
is both scientific and medical. By understanding the genetic makeup of an
organism in excruciating detail, it is hoped that we will be better able to
understand how organisms develop from single eggs into complex multicellular
beings, how food is metabolized and transformed into the constituents of the
body, how the nervous system assembles itself into a smoothly functioning
ensemble. From the medical point of view, the wealth of knowledge that will come
from knowing the complete DNA sequence will greatly accelerate the process of
finding the causes of (and potential cures for) human diseases.
Six years after its birth, the genome project is ahead of schedule. Detailed
maps of the human and all the experimental animals have been completed (mapping
out the DNA using a series of landmarks is an obligatory first step before
determining the complete DNA sequence). The sequence of the smallest model
organism, yeast, is nearly completed, and the sequence of the next smallest, a
tiny soil-dwelling worm, isn't far behind. Large scale sequencing efforts for
human DNA started at several centers a number of months ago and will be in full
swing within the year.
The scale of the human DNA sequencing project is enough to send your average
Unix system administrator running for cover. From the information-handling point
of view, DNA is a very long string consisting of the four letters G, A, T and C
(the letters are abbreviations for the four chemical units that form the "rungs"
of the DNA double helix ladder). The goal of the project is to determine the
order of letters in the string. The size of the string is impressive but not
particularly mind-boggling: 3 x 10^9 letters long, or some 3 gigabytes of
storage space if you use 1 byte to store each letter and don't use any
compression techniques.
Three gigabytes is substantial but certainly manageable by today's standards.
Unfortunately, this is just the storage space required to store the finished
data. The storage requirements for the experimental data needed to determine
this sequence is far more vast. The essential problem is that DNA sequencing
technology is currently limited to reading stretches of at most 500 contiguous
letters. In order to determine sequences longer than that, the DNA must be
sequenced as small overlapping fragments called "reads" and the jigsaw puzzle
reassembled by algorithms that look for areas where the sequences match. Because
the DNA sequence is nonrandom (similar but not-entirely-identical motifs appear
many times throughout the genome), and because DNA sequencing technology is
noisy and error-prone, one ends up having to sequence each region of DNA five to
10 times in order to reliably assemble the reads into the true sequence. This
increases the amount of data to manage by an or! der of magnitude. On top of
this is all the associated information that goes along with laboratory work: who
performed the experiment, when it was performed, the section of the genome was
sequenced, the identity and version of the software used to assemble the
sequence, any comments someone wants to attach to the experiment, and so forth.
In addition, one generally wants to store the raw output from the machine that
performs the sequencing itself. Each 500 letters of sequence generates a data
file that's 20-30 kilobytes long!
That's not the whole of it. It's not enough just to determine the sequence of
the DNA. Within the sequence are functional areas scattered among long stretches
of nonfunctional areas. There are genes, control regions, structural regions,
and even a few viruses that got entangled in human DNA long ago and persist as
fossilized remnants. Because the genes and control regions are responsible for
health and disease, one wants to identify and mark them as the DNA sequence is
assembled. This type of annotation generates yet more data that needs to be
tracked and stored.
Altogether, people estimate that some 1 to 10 terabytes of information will
need to be stored in order to see the human genome project to its conclusion.
So what's Perl got to do with it? From the beginning researchers realized
that informatics would have to play a large role in the genome project. An
informatics core formed an integral part of every genome center that was
created. The mission of these cores was two-fold: to provide computer support
and databasing services for their affiliated laboratories, and to develop data
analysis and management software for use by the genome community as a whole.
It's fair to say that the initial results of the informatics groups efforts
were mixed. Things were slightly better on the laboratory management side of the
coin. Some groups attempted to build large monolithic systems on top of complex
relational databases; they were thwarted time and again by the highly dynamic
nature of biological research. By the time a system that could deal with the ins
and outs of a complex laboratory protocol had been designed, implemented and
debugged, the protocol had been superseded by new technology and the software
engineers had to go back to the drawing board.
Most groups, however, learned to build modular, loosely-coupled systems whose
parts could be swapped in and out without retooling the whole system. In my
group, for example, we discovered that many data analysis tasks involve a
sequence of semi-independent steps. Consider the steps that one may want to
perform on a bit of DNA that has just been sequenced (Figure 1). First there's a
basic quality check on the sequence: is it long enough? Are the number of
ambiguous letters below the maximum limit? Then there's the "vector check". For
technical reasons, the human DNA must be passed through a bacterium before it
can be sequenced (this is the process of "cloning"). Not infrequently, the human
DNA gets lost somewhere in the process and the sequence that's read consists
entirely of the bacterial vector. The vector check ensures that only human DNA
gets into the database. Next there's a check for repetitive sequences. Human DNA
is full of repetitive elements that make fit! ting the sequencing jigsaw puzzle
together challenging. The repetitive sequence check tries to match the new
sequence against a library of known repetitive elements. A penultimate step is
to attempt to match the new sequence against other sequences in a large
community database of DNA sequences. Often a match at this point will provide a
clue to the function of the new DNA sequence. After performing all these checks,
the sequence along with the information that's been gathered about it along the
way is loaded into the local laboratory database.