|
The process of passing a DNA sequence through these
independent analytic steps looks kind of like a
pipeline, and it didn't take us long to realize that a
Unix pipe could handle the job. We developed a simple
Perl-based data exchange format called "boulderio"
that allowed loosely coupled programs to add information
to a pipe-based I/O stream. Boulderio is based on
tag/value pairs. A Perl module makes it easy for
programs to reach into the input stream, pull out only
the tags they're interested in, do something with them,
and drop new tags into output the stream. Any tags that
the program isn't interested in are just passed through
to standard output so that other programs in the
pipeline can get to them.
Using this type of scheme, the process of analyzing a
new DNA sequence looks something like this (this is not
exactly the set of scripts that we use, but it's close
enough):
name_sequence.pl < new.dna |
quality_check.pl |
vector_check.pl |
find_repeats.pl |
search_big_database.pl |
load_lab_database.pl
A file containing the new DNA sequence is processed
by a perl script named "name_sequence.pl",
whose only job is to give the sequence a new unique name
and to put it into boulder format. Its output looks like
this:
NAME=L26P93.2
SEQUENCE=GATTTCAGAGTCCCAGATTTCCCCCAGGGGGTTTCCAGAGAGCCC......
The output from name_sequence.pl is next passed to
the quality checking program, which looks for the
SEQUENCE tag, runs the quality checking algorithm, and
writes its conclusion to the data stream. The data
stream now looks like this:
NAME=L26P93.2
SEQUENCE=GATTTCAGAGTCCCAGATTTCCCCCAGGGGGTTTCCAGAGAGCCC......
QUALITY_CHECK=OK
Now
the data stream enters the vector checker. It pulls the
SEQUENCE tag out of the stream and runs the vector
checking algorithm. The data stream now looks like this:
NAME=L26P93.2
SEQUENCE=GATTTCAGAGTCCCAGATTTCCCCCAGGGGGTTTCCAGAGAGCCC......
QUALITY_CHECK=OK
VECTOR_CHECK=OK
VECTOR_START=10
VECTOR_LENGTH=300
This
continues down the pipeline, until at last the "load_lab_database.pl"
script collates all the data collected, makes some final
conclusions about whether the sequence is suitable for
further use, and enters all the results into the
laboratory database.
One of the nice features of the boulderio format is
that multiple sequence records can be processed
sequentially in the same Unix pipeline. An "="
sign marks the end of one record and the beginning of
the next:
NAME=L26P93.2
SEQUENCE=GATTTCAGAGTCCCAGATTTCCCCCAGGGGGTTTCCAGAGAGCCC......
=
NAME=L26P93.3
SEQUENCE=CCCCTAGAGAGAGAGAGCCGAGTTCAAAGTCAAAACCCATTCTCTCTCCTC...
=
There's
also a way to create subrecords within records, allowing
for structured data types.
Here's an example of a script that processes
boulderio format. It uses an object-oriented style, in
which records are pulled out of the input stream,
modified, and dropped back in:
use Boulder::Stream;
$stream = new Boulder::Stream;
while ($record = $stream->read_record('NAME','SEQUENCE')) {
$name = $record->get('NAME');
$sequence = $record->get('SEQUENCE');
...[continue processing]...
$record->add(QUALITY_CHECK=>"OK");
$stream->write_record($record);
}
(If
you're interested, more information about the boulderio
format and the perl libraries to manipulate it can be
found at http://stein.cshl.org/software/boulder/).
The interesting thing is that multiple informatics
groups independently converged on solutions that were
similar to the boulderio idea. For example, several
groups involved in the worm sequencing project began
using a data exchange format called ".ace".
Although this format was initially designed as the data
dump and reload format for the ACE database (a database
specialized for biological data), it happens to use a
tag/value format that's very similar to boulderio. Soon
.ace files were being processed by Perl script pipelines
and loaded into the ACE database at the very last step.
Perl found uses in other aspects of laboratory
management. For example, many centers, including my own,
use Web based interfaces for displaying the status of
projects and allowing researchers to take actions. Perl
scripts are the perfect engine for Web CGI scripts.
Similarly, Perl scripts run e-mail database query
servers, supervise cron jobs, prepare nightly reports
summarizing laboratory activity, create instruction
files to control robots, and handle almost every other
information management task that a busy genome center
needs.
So as far as laboratory management went, the
informatics cores were reasonably successful. As far as
development and distributing generally useful, however,
things were not so rosy.
The problem will be familiar to anyone who has worked
in a large, loosely organized software project. Despite
best intentions, the project begins to drift.
Programmers go off to work on ideas that interest them,
modules that need to interface with one another are
designed independently, and the same problems get solved
several times in different, mutually incompatible ways.
When the time comes to put all the parts together,
nothing works.
This is what happened in the genome project. Despite
the fact that everyone was working on the same problems,
no two groups took exactly the same approach. Programs
to solve a given problem were written and rewritten
multiple times. While a given piece of software wasn't
guaranteed to work better than its counterpart developed
elsewhere, you could always count on it to sport its own
idiosyncratic user interface and data format. A typical
example is the central algorithm that assembles
thousands of short DNA reads into an ordered set of
overlaps. At last count there were at least six
different programs in widespread use, and no two of them
use the same data input or output formats.
This lack of interchangeability presents terrible
dilemma for the genome centers. Without
interchangeability, an informatics group is locked into
using the software that it developed in-house. If
another genome center has come up with a better software
tool to attack the same problem, a tremendous effort is
required by the first center to retool their system in
order to use that tool.
The long range solution to this problem is to come up
with uniform data interchange standards that genome
software must adhere to. This would allow common modules
to be swapped in and out easily. However, standards
require time to agree on, and while the various groups
are involved in discussion and negotiation, there is
still an urgent need to adapt existing software to the
immediate needs of the genome centers.
|