|
Teaching machines to do science beyond comprehension |
|
[编者的话] 生物学的数据,尤其是来自于基因组方面的数据,一直在以指数的态势增长着,什么样的力量可以处理这些庞大的数据?超级计算机可以吗?IBM公司设计了blue-gene,意图增强其在生物学战略投资方面的力量。但是,也许真正的出路不在于计算力量的强大,数学方法,比如SVM,ANN等算法的应用和新算法的研究,是更加有意义的事情。
by Philippa Maister Peter
J. Farley, a physician with a long and intriguing history in
the biotechnology industry, has made a career out of backing powerful
ideas. Now he's putting his money and effort into a new venture, one he
predicts will be the next revolution in biology. IBM may be pursuing development
of a $100 million initiative to build Blue
Gene, a supercomputer 500 times more powerful than today's
most powerful model. But Farley contends that the future of biology lies
not in supercomputers but in pure mathematics. The only way to
comprehend the immense masses of data arising from genomics and
proteomics, he believes, is to create algorithms that teach machines to
learn how to analyze the data for us. Farley's company, BIOwulf
Genomics of Savannah, Georgia, has put together a team of
skilled mathematicians and immunologists. Many of the mathematicians are
students of Vladimir
Vapnik, a renowned expert in machine learning. He is
technology leader with Bell Labs in New Jersey and a professor at London
University's Royal Holloway College. Vapnik also serves as cochair of
BIOwulf's scientific advisory board. "I am accustomed to the
care and feeding of geniuses," Farley boasts. He bases this claim
on his experience as the founder of Cetus Corporation, which was
acquired by Chiron
Corporation of Emeryville, California, in 1991. Farley served
as CEO of Cetus from 1971 to 1977 and its president from 1977 to 1983.
In 1979, Cetus hired a young chemist named Kary Mullis, who in 1983 -
the year Farley left Cetus - came up with the concept of polymerase
chain reaction (PCR) and went on to win the 1993
Nobel Prize in chemistry. Now, Farley's new firm intends
its resident mathematicians to capitalize on the power of machine
learning, and more specifically on mathematical techniques called
support vector machines (SVMs). The immediate goal is to identify genes
associated with disease. Vapnik developed the concept of
SVMs, and wrote the standard treatise on the subject, Statistical
Learning Theory. SVMs are a new-generation learning system
based on recent advances in statistical learning theory. They are valued
for their ability to classify data and to identify incorrect
classifications - qualities that are ideal for working with gene
expression datasets that contain measurements for thousands of genes.
SVMs fall within the broader category of machine learning, a technique
for developing algorithms that can learn to identify relationships
within data. In an interview, Vapnik
emphasized that BIOwulf is not the only company or institution using
SVMs to identify and classify genes linked to diseases. However, he
believes it is ahead of the game because it uses state-of-the-art models
for generalization and has hired some of the best mathematicians in the
field from all over the world. For Farley, SVMs make a lot
more sense than supercomputers in biology. "The amount of data that
will be generated by genomics is so far beyond human comprehension that
you don't solve it with hardware. You solve it with the new
mathematics," Farley told a symposium audience in Atlanta recently.
"If you start calculating the permutations of all the data and the
complexity of their interactions, it gets unbelievably complex." Farley also contends that
mathematics is able to make sense of data that otherwise registers as
junk or background noise. "With machine learning approaches, no
data is lost," he said. "For us, nothing is noise." According to Farley, BIOwulf's
mathematicians have already demonstrated the power of the method by
distinguishing Stage 1 prostate cancer from Stage 4 - and Stage 3 from
Stage 4. "We did this on a reasonably fast computer in about three
minutes," he said. Farley claims the team has also been able to
identify seven previously unknown genes associated with colon cancer
(each of them a potential therapeutic target) and to discover that
patients who have been infected with trypanosomiasis are resistant to
colon cancer - "an obvious target for a vaccine," he said. Another research group is using
SVM techniques to sort clinical information on microarrays. Terrence
S. Furey at the University of California at Santa Cruz, and
colleagues at the University of Bristol in the UK and the University of
Washington in Seattle used the method to classify ovarian cancer
tissues, normal ovarian tissues, and other normal tissues based on
microarray gene expression data. The team
was also able to show that SVMs can identify mislabeled data on
microarrays. SVMs can teach individual
computers to grapple with the complexities of masses of genomic data.
Another method of machine learning, artificial neural networks (ANN),
links individual computers in an approach that mimics how neurons
function in the brain. ANNs derive their power from their ability to be
"trained" to learn by example. They recognize patterns and are
used to solve complex classification problems, such as voice recognition
and fingerprint identification. In a study described as a
first, scientists at the US National
Institutes of Health and Lund University in Sweden used ANN
to distinguish between microarray expression patterns for four types of
pediatric cancers: neuroblastoma, rhabdomyosarcoma, non-Hodgkin's
lymphoma, and Ewing's sarcoma. They did this on ordinary desktop
computers. The study
appeared earlier this month in Nature Medicine. "Gene expression data is
just numbers, and the computer software does not need to know more than
that to recognize very complex patterns," said lead author Javed
Khan, a pediatric oncologist in the Cancer
Genetics Branch of the National
Human Genome Research Institute (NHGRI), in an interview. Using ANN and a blinded
experimental design, the team classified tumor biopsy materials and cell
lines from the four types of cancer with 100 percent accuracy, according
to their gene expression signatures. The finding opens the way for the
method to be used eventually for diagnostic purposes, Khan said. Khan noted that the four
cancers studied are particularly difficult to differentiate. The study
started with cDNA microarrays containing 6,567 genes - all that were
available when the study began in 1997, according to Khan. By its
conclusion, the team had identified 61 genes specifically expressed in
one of the cancers, 41 of which have never before been associated with
the disease. "We also identified, in
ranked order, the genes that contributed to this classification, and we
were able to define a minimal set that can correctly classify our
samples into their diagnostic categories," he noted. "If it is
easy to do in children's cancers, it can be done in adult cancers
too." The next step, Khan said, is to
determine whether the cancers depend on the genes the team has
identified, and whether those genes can be inhibited. The ultimate goal is to assess
whether, for a particular patient, a treatment will work. Stage 1
neuroblastoma, for example, has a 90 percent cure rate. Stage 4 is
currently incurable. "Can we find out why these genes are different
and can we predict which patients will do well on a treatment?"
Khan said. In an interview, Vapnik argued
that the SVM method has advantages over ANNs. In particular, SVMs are
able to make accurate findings based on a small amount of data.
"The smaller the number of examples, the better," Vapnik said.
"The relationship is strong. A couple of examples are sufficient to
know what is going on." In addition, SVM is able to generalize from
limited data. Although Farley disparages
supercomputers, Javed Khan of the NHGRI disagrees. "Mathematical
modeling is very important, but there are certain things that
supercomputers can do well, like looking at gene interactions such as
how gene A affects gene B but not gene C," Khan noted. Perhaps not surprisingly, the
senior manager of the Computational
Biology Center at IBM Research, Joe Jasinski, defends the
value of supercomputers in biology. He noted that IBM's Blue Gene will
be able to perform more than one quadrillion operations per second (one
petaflop). By comparison, the two fastest computers today perform only
two trillion operations (two teraflops) per second. Designing a computer
of this scale presented a formidable challenge in itself, given its size
and electrical and cooling requirements. Blue Gene's first use will be
to study not disease diagnosis but protein folding. The goal is to
simulate the molecular processes by which protein folding occurs, the
interaction of proteins, and the bonding of small molecules. To do this
requires massive computing power, Jasinski said. The five-year Blue Gene
project is still three years from completion. However, IBM is also putting
its muscle behind disease diagnosis by backing NuTec
Sciences, a young Atlanta-based company that has assembled a
supercluster of 1,250 IBM eServer systems with a processing capacity of
7.5 trillion calculations per second. The system is said to be the
fastest outside a government agency and one of the most powerful in the
world. NuTec has formed a partnership
with the Winship
Cancer Institute at Emory University in Atlanta to develop an
integrated information system to identify genes and gene combinations
that cause cancer in an individual patient. NuTec's system will compare a
patient's genetic fingerprint to thousands of different genetic profiles
from various public and private databases. It will run algorithms to
analyze gene combinations associated with cancer and determine the most
effective treatments available for the individual patient. The system,
which will accelerate diagnosis, is to be up and running by the end of
the year. Will there ultimately be a
partnership between the brute force of supercomputers and the intricate
workings of mathematical brainpower? Or will it remain a contest? Only
the future will tell.
|
|
|
|
1999-2005 中国科学院上海生命科学研究院生物信息中心 |