|
After the draft sequence |
|
[编者的话] 两大组织都各自发表自己的人类基因组测序结果,然而即使在测序本身仍有许多工作要做,例如确保所有的基因都被找到,所有的片段都被正确拼装. 更为重要的是,现在的工作仍然停留在表层(first look),更为深入系统的工作则需要进一步的大规模的开展。
Sydney Brenner We now have the two reports on
the human genome sequence, one, in Nature (2001, 409:860-921), by the
International Human Genome Sequencing Consortium (GC), the other, in
Science (2001, 291:1304-1351), by Celera Genomics (CG). Each is
accompanied by a flurry of secondary papers analyzing different aspects
of the sequence, and these will no doubt be followed in the future by
more analyses as, at least, the public sequence is available with no
restrictions whatsoever to those who wish to examine it. The amount of
information is enormous and all we have here is the surface. I have
spent some time on both papers but the material will require much deeper
reading in months to come. I will begin by saying
something about the two approaches used. GC based their approach on
sequencing ordered, large-insert bacterial artificial chromosome (BAC)
libraries which had previously been shown to produce data with 99.99%
accuracy and no gaps. It makes sense to talk about such levels of
accuracy when dealing with single cloned stretches of DNA. A collection
of such BACs would not provide a total human genome sequence with that
accuracy, however, but instead a singular mosaic, not representing any
existing human genome. With a level of polymorphism of 0.1%, it is
clearly not possible to talk about this level of accuracy for the genome
sequence. The insistence on high accuracy may have carried with it
certain costs, and it is clear that the switch to a draft sequence
because of CG's entry into the field speeded up the work considerably.
In retrospect it might have been more reasonable to have aimed at that
from the start. There were some heretics who thought that this would be
an important first step rather than wasting resources on the precise
sequencing of all those Alu repeats. By October 2000, when the GC data
were assembled, 900 megabases had been sequenced with 20-25 X coverage
(each base sequenced an average of 20-25 times) and were considered
finished, 3000 megabases were in draft form (12 X coverage) and a
minority, 270 megabases, were in pre-draft form (6 X coverage). All of
these data were available to CG. CG's assembly was based on a
'mate-pair' strategy. This is not completely random, as many believe,
but provides results in which one half of the data is spatially
correlated with the other half, since two sequences are collected from
the two ends of clones with inserts of several sizes (2,10 and 50
kilobases). CG used two approaches in their assembly. In the first, the
publicly available sequence from more than 30,000 BACs was shredded,
pooled with CG's own data and assembled. In the second, the known BAC
clustering was preserved and these clusters were added to those
generated from CG's data together with an additional set of more than
104,078 BAC end sequences. The two methods, it is stated, gave
essentially the same results. CG tested their computational assembler by
shredding the sequences of chromosomes 21 and 22, both of which had been
sequenced to high accuracy, and found that they assembled to give much
the same structure but there were gaps. About half of the gaps contained
sequences with a large fraction of repetitive elements, which could
account for the failure of the assembly. The question of whether the CG
assembly would have worked without the public data is already being
hotly debated. It is a great pity that CG did not attempt an assembly of
their own data first, to see how far they could do it without recourse
to the public data. In the interests of scientific objectivity this
would have been the wisest procedure even if it failed to provide
results up to expectation. It is also clear that mapping data were also
used to order the segments on the larger scale. According to the public press,
the most significant result of both groups is the small number of genes
(better called gene loci). About 26,000 were found by CG while the
estimate of GC ranges from 30,000 to 40,000. This is very far from the
100,000 many people expected and far from the 120,000 predicted from
cDNA sequencing. The latter number can be explained by alternatively
spliced products, but I think that the number of gene loci will turn out
to be at the higher end of the estimates, possibly as high as 50,000.
The low numbers could be accounted for by the inadequacy of the
gene-finding programs in regions which are gene-poor, and where a few
kilobases of exons might be scattered through hundreds of kilobases of
intron sequences. Time will tell. In any event, when we have the loci
defined we will still need to characterize the gene products for each
one, and to analyze the repertoire in the many different cell types.
This will be a task even greater than that of sequencing the genome. I found Table 19 in the CG
paper very interesting, as it compared the numbers of genes of different
types in the various genomes sequenced. In particular, there are a large
number of genes encoding regulatory proteins, including 607
zinc-finger-containing proteins, as opposed to 232 in flies. Of course,
we do not know how complexity scales with this increase, and clearly
understanding this will be central to formulating theories of the
complexity of organisms. The GC report has used the global information
arising from having most of the sequence to analyze the evolution of
transposable elements, mutation rates, CpG islands and many other
sequence features. Both reports discuss single-nucleotide polymorphism
(SNP) data, and the availability of large numbers of these should enable
a large number of association studies in man, as well as the measurement
of linkage disequilibrium over the entire genome. It is clear that a massive
amount of work must still be done on the sequence itself. It has to be
finished, to make sure that all the genes have been found and that all
the pieces are correctly ordered. I hope that there will be no
relaxation from this task. I also hope that these hastily assembled
papers will not be taken as anything but a first look. When all of that
is done, we will have really finished sequencing the human genome, and
we can look forward to the even more daunting task of understanding the
regulation of genes, and the host of novel problems that will be
uncovered by the sequencing of other mammals and other vertebrates.
|
|
|
|
1999-2005 中国科学院上海生命科学研究院生物信息中心 |