新闻 | 论坛 | 生物信息学专题 | 新思路 | 软件下载 | 相关数据库 | 免费主页

网站首页 BioSino Databese BioSino Lab BioSino Navigator 关于本站

 
站内搜索:  

Teaching machines to do science beyond comprehension

 

[编者的话]

生物学的数据,尤其是来自于基因组方面的数据,一直在以指数的态势增长着,什么样的力量可以处理这些庞大的数据?超级计算机可以吗?IBM公司设计了blue-gene,意图增强其在生物学战略投资方面的力量。但是,也许真正的出路不在于计算力量的强大,数学方法,比如SVM,ANN等算法的应用和新算法的研究,是更加有意义的事情。

 

by Philippa Maister

Peter J. Farley, a physician with a long and intriguing history in the biotechnology industry, has made a career out of backing powerful ideas. Now he's putting his money and effort into a new venture, one he predicts will be the next revolution in biology.

IBM may be pursuing development of a $100 million initiative to build Blue Gene, a supercomputer 500 times more powerful than today's most powerful model. But Farley contends that the future of biology lies not in supercomputers but in pure mathematics. The only way to comprehend the immense masses of data arising from genomics and proteomics, he believes, is to create algorithms that teach machines to learn how to analyze the data for us.

Farley's company, BIOwulf Genomics of Savannah, Georgia, has put together a team of skilled mathematicians and immunologists. Many of the mathematicians are students of Vladimir Vapnik, a renowned expert in machine learning. He is technology leader with Bell Labs in New Jersey and a professor at London University's Royal Holloway College. Vapnik also serves as cochair of BIOwulf's scientific advisory board.

"I am accustomed to the care and feeding of geniuses," Farley boasts. He bases this claim on his experience as the founder of Cetus Corporation, which was acquired by Chiron Corporation of Emeryville, California, in 1991. Farley served as CEO of Cetus from 1971 to 1977 and its president from 1977 to 1983. In 1979, Cetus hired a young chemist named Kary Mullis, who in 1983 - the year Farley left Cetus - came up with the concept of polymerase chain reaction (PCR) and went on to win the 1993 Nobel Prize in chemistry.

Now, Farley's new firm intends its resident mathematicians to capitalize on the power of machine learning, and more specifically on mathematical techniques called support vector machines (SVMs). The immediate goal is to identify genes associated with disease.

Vapnik developed the concept of SVMs, and wrote the standard treatise on the subject, Statistical Learning Theory. SVMs are a new-generation learning system based on recent advances in statistical learning theory. They are valued for their ability to classify data and to identify incorrect classifications - qualities that are ideal for working with gene expression datasets that contain measurements for thousands of genes. SVMs fall within the broader category of machine learning, a technique for developing algorithms that can learn to identify relationships within data.

In an interview, Vapnik emphasized that BIOwulf is not the only company or institution using SVMs to identify and classify genes linked to diseases. However, he believes it is ahead of the game because it uses state-of-the-art models for generalization and has hired some of the best mathematicians in the field from all over the world.

For Farley, SVMs make a lot more sense than supercomputers in biology. "The amount of data that will be generated by genomics is so far beyond human comprehension that you don't solve it with hardware. You solve it with the new mathematics," Farley told a symposium audience in Atlanta recently. "If you start calculating the permutations of all the data and the complexity of their interactions, it gets unbelievably complex."

Farley also contends that mathematics is able to make sense of data that otherwise registers as junk or background noise. "With machine learning approaches, no data is lost," he said. "For us, nothing is noise."

According to Farley, BIOwulf's mathematicians have already demonstrated the power of the method by distinguishing Stage 1 prostate cancer from Stage 4 - and Stage 3 from Stage 4. "We did this on a reasonably fast computer in about three minutes," he said. Farley claims the team has also been able to identify seven previously unknown genes associated with colon cancer (each of them a potential therapeutic target) and to discover that patients who have been infected with trypanosomiasis are resistant to colon cancer - "an obvious target for a vaccine," he said.

Another research group is using SVM techniques to sort clinical information on microarrays. Terrence S. Furey at the University of California at Santa Cruz, and colleagues at the University of Bristol in the UK and the University of Washington in Seattle used the method to classify ovarian cancer tissues, normal ovarian tissues, and other normal tissues based on microarray gene expression data. The team was also able to show that SVMs can identify mislabeled data on microarrays.

SVMs can teach individual computers to grapple with the complexities of masses of genomic data. Another method of machine learning, artificial neural networks (ANN), links individual computers in an approach that mimics how neurons function in the brain. ANNs derive their power from their ability to be "trained" to learn by example. They recognize patterns and are used to solve complex classification problems, such as voice recognition and fingerprint identification.

In a study described as a first, scientists at the US National Institutes of Health and Lund University in Sweden used ANN to distinguish between microarray expression patterns for four types of pediatric cancers: neuroblastoma, rhabdomyosarcoma, non-Hodgkin's lymphoma, and Ewing's sarcoma. They did this on ordinary desktop computers. The study appeared earlier this month in Nature Medicine.

"Gene expression data is just numbers, and the computer software does not need to know more than that to recognize very complex patterns," said lead author Javed Khan, a pediatric oncologist in the Cancer Genetics Branch of the National Human Genome Research Institute (NHGRI), in an interview.

Using ANN and a blinded experimental design, the team classified tumor biopsy materials and cell lines from the four types of cancer with 100 percent accuracy, according to their gene expression signatures. The finding opens the way for the method to be used eventually for diagnostic purposes, Khan said.

Khan noted that the four cancers studied are particularly difficult to differentiate. The study started with cDNA microarrays containing 6,567 genes - all that were available when the study began in 1997, according to Khan. By its conclusion, the team had identified 61 genes specifically expressed in one of the cancers, 41 of which have never before been associated with the disease.

"We also identified, in ranked order, the genes that contributed to this classification, and we were able to define a minimal set that can correctly classify our samples into their diagnostic categories," he noted. "If it is easy to do in children's cancers, it can be done in adult cancers too."

The next step, Khan said, is to determine whether the cancers depend on the genes the team has identified, and whether those genes can be inhibited.

The ultimate goal is to assess whether, for a particular patient, a treatment will work. Stage 1 neuroblastoma, for example, has a 90 percent cure rate. Stage 4 is currently incurable. "Can we find out why these genes are different and can we predict which patients will do well on a treatment?" Khan said.

In an interview, Vapnik argued that the SVM method has advantages over ANNs. In particular, SVMs are able to make accurate findings based on a small amount of data. "The smaller the number of examples, the better," Vapnik said. "The relationship is strong. A couple of examples are sufficient to know what is going on." In addition, SVM is able to generalize from limited data.

Although Farley disparages supercomputers, Javed Khan of the NHGRI disagrees. "Mathematical modeling is very important, but there are certain things that supercomputers can do well, like looking at gene interactions such as how gene A affects gene B but not gene C," Khan noted.

Perhaps not surprisingly, the senior manager of the Computational Biology Center at IBM Research, Joe Jasinski, defends the value of supercomputers in biology. He noted that IBM's Blue Gene will be able to perform more than one quadrillion operations per second (one petaflop). By comparison, the two fastest computers today perform only two trillion operations (two teraflops) per second. Designing a computer of this scale presented a formidable challenge in itself, given its size and electrical and cooling requirements.

Blue Gene's first use will be to study not disease diagnosis but protein folding. The goal is to simulate the molecular processes by which protein folding occurs, the interaction of proteins, and the bonding of small molecules. To do this requires massive computing power, Jasinski said. The five-year Blue Gene project is still three years from completion.

However, IBM is also putting its muscle behind disease diagnosis by backing NuTec Sciences, a young Atlanta-based company that has assembled a supercluster of 1,250 IBM eServer systems with a processing capacity of 7.5 trillion calculations per second. The system is said to be the fastest outside a government agency and one of the most powerful in the world.

NuTec has formed a partnership with the Winship Cancer Institute at Emory University in Atlanta to develop an integrated information system to identify genes and gene combinations that cause cancer in an individual patient.

NuTec's system will compare a patient's genetic fingerprint to thousands of different genetic profiles from various public and private databases. It will run algorithms to analyze gene combinations associated with cancer and determine the most effective treatments available for the individual patient. The system, which will accelerate diagnosis, is to be up and running by the end of the year.

Will there ultimately be a partnership between the brute force of supercomputers and the intricate workings of mathematical brainpower? Or will it remain a contest? Only the future will tell.


1999-2005 中国科学院上海生命科学研究院生物信息中心  
版权所有 All rights reserved.