|
Genome Sequences and Evolutionary Biology, a Two-Way Interaction |
|
[编者的话] 比较基因组学是生物信息学研究的重要课题,下面这篇文章虽然发表在人类基因组测序结果公布之前,但文章中对比较基因组学研究的方向和重点的认识却充满了真知灼见。
by Deborah Charlesworth, Brian Charlesworth, and Gilean A.T. McVean Complete genome sequences are
accumulating rapidly, culminating with the announcement of the human
genome sequence in February 2001. In addition to cataloguing the
diversity of genes and other sequences, genome sequences will provide
the first detailed and complete data on gene families and genome
organization, including data on evolutionary changes. Reciprocally,
evolutionary biology will make important contributions to the efforts to
understand functions of genes and other sequences in genomes.
Large-scale, detailed, and unbiased comparisons between species will
illuminate the evolution of genes and genomes, and population genetics
methods will enable detection of functionally important genes or
sequences, including sequences that have been involved in adaptive
changes. Perhaps the most spectacular
recent development in genetics has been the sequencing of complete
genomes of prokaryotes and yeast and, recently, several further
eukaryote species. The first aim of such projects is to catalogue the
diversity of genes, regulatory sequences, and intervening regions. By
classifying sequences into similar "families," and by assuming
that conservation of sequence implies some degree of conservation of
function, informed guesses can be made about the possible functional
roles of newly discovered genes[2].
These can be fitted into our growing understanding of the genetic
control of cellular and developmental processes and, ultimately, of
biological diversity. Such a procedure rests implicitly on the theory of
evolution, which interprets sequence similarity as generally implying
shared ancestry, and views gene duplications as the ultimate source of
evolutionary novelty and increased genome complexity. Another major aim of genomics
research is therefore to identify differences between genomes of species
or individuals. One of the biggest surprises from whole genome
sequencing is the high proportion of genes in each new species studied
that cannot be recognized from knowledge of other genomes, and the
differences in the repertoire of genes from species to species [1-4].
As a first step towards characterizing the genetic basis of adaptations,
comparing complete genomes of related species should allow us to
identify the genetic basis of particularly interesting diverged traits.
Closely related to this is the identification of regions of functional
significance by looking for conserved genomic regions, on the reasonable
assumption that natural selection usually removes harmful mutations and
preserves functionally important sequences [5].
No approach other than complete genome sequencing provides more than
fragmentary data on conserved and diverged sequences. It is, however, a
major evolutionary problem to distinguish between differences between
species that are caused by selection pressures, and changes irrelevant
to organismal fitness, which have been fixed by genetic drift. The importance of evolutionary
ideas for making use of genome sequences is already clear, and an
increasing proportion of papers in journals devoted to genomic research
mention evolution (21% of papers in the journal Genome Research in the
year 2000). Evolutionary theory provides a framework for understanding
the relationship between sequence divergence and change in gene
function, but many evolutionary questions require study of only small
samples of genes [6].
Here, we concentrate on questions that will benefit from the
availability of entire genome sequences, touching on only a few
important issues, and giving only a selection of recent references.
Prokaryote genomes have been reviewed recently [7],
so we focus on three areas where evolutionary approaches are
particularly relevant to understanding eukaryote genomes and gene
function: the evolution of gene families; the molecular basis of
adaptation [8];
and estimating the genomic deleterious mutation rate [8].
Table
1 lists many other questions also illuminated by the new sequence
data. For instance, no other approach can provide detailed information
about genome rearrangements, such as inversions, which have become
established over evolutionary time in the genomes of both eukaryotes and
prokaryotes. Cytogenetic methods and genetic mapping either detect only
large rearrangements, or are restricted to small genome regions [9].
Complete genome sequencing gives a view that is both high-magnification
and wide in scope, providing a much more detailed picture of the
evolutionary dynamics of whole genome organization. Eventually, this
should lead to an understanding of the evolutionary forces affecting
genome rearrangements. Gene Families A major part of understanding
the evolution of genomes concerns gene families. These range in size
from a few genes to many hundreds [10],
and are known in genomes of all kinds of organisms. However, our
knowledge of the abundance and scale of gene families is fragmentary,
and we have only a poor understanding of the relative importance in real
genomes of different evolutionary processes theoretically expected to
affect non-single-copy genes. Only by studying complete genome sequences
will we get complete information about how many genes of a given gene
family exist in a sequenced genome, at all degrees of divergence from
one another [11,12]. It is already clear that copy
numbers of members of gene families are often underestimated by
non-genomic approaches, because further genes are often found when
detailed studies are done [12,13],
and in eukaryote genomes that have been sequenced, such as
Caenorhabditis elegans [11]
and Arabidopsis thaliana [14],
many duplicate genes are detectable [15].
Such data on the "natural history of genomes" are needed
before we can ask about the rate and cause of gene turnover. For these
studies, reliable evidence about which genes are orthologous (figure 1)
must be obtained [1].
It is very difficult to be sure of orthology, because birth and death of
genes occurs at a significant rate in many gene families. Well-studied
examples include the mammalian major histocompatibility complex (MHC)
genes [16],
plant disease resistance genes [17]
and Adh genes [18].
We have little understanding of why some gene families undergo birth and
death processes more than others. This process greatly complicates
sequence comparisons of several genome evolutionary processes that
obscure gene relationships [14],
because one cannot be sure that sequences are orthologous. In addition, gene
conversion between different members of gene families often causes
them to have similar sequences ("concerted evolution" [12];
see glossary).
It can also accelerate the sequence evolution of one particular member
of a gene
family, because information from a different member is incorporated
[12].
Phylogenetic inference may therefore need to be restricted to sequences
that are distinctive enough to be treated as single-copy loci, and that
persist over long evolutionary time scales. Genome sequences are providing
unexpected data about one source of gene family expansion, large-scale
duplications [14,19].
Duplications of large genome segments, or entire genome duplication, can
be detected when two or more sets of similar genetically linked, or
physically similarly arranged, genes are found. In yeast, such findings
suggest that the entire genome was duplicated [19].
The A. thaliana genome contains extensive duplications between different
nuclear regions [13,20],
and the second chromosome contains a large duplicated portion of the
mitochondrial genome [21].
This confirms earlier evidence for duplication of organelle genome
sequences in the nuclear genome [22],
but it was not previously known how long these sequences could be. Such
unusual genomic evolutionary events, and the relative frequencies of
duplications producing tandem versus dispersed repeats of genes, are
hard to determine other than from complete genome sequences and, even
then, clear inferences are difficult to make [23].
Knowledge of the frequency and nature of such events will significantly
advance our understanding of genome evolution. Genome sequences also provide
information about the evolutionary events following the appearance of
new members of gene families. Only 8% of the yeast genes present after
the duplication 10 [8]
million years ago seem to have been retained, mostly as very similar,
functionally redundant duplicates [18].
Why certain duplicate genes are retained is unknown. Sometimes,
advantageous mutations may lead to new functions [23].
Sometimes, high activity of the proteins encoded by duplicate genes is
beneficial, but duplicate genes may not always perform redundant, or
even largely equivalent, functions. In yeast, the only organism where
this has been tested, sequence similarity does not guarantee similar
expression patterns [24].
Whole genome sequences also provide information about loss of duplicated
genes, and have provided many examples of functionally silent pseudogenes,
with mutations such as stop codons, frame shifts, or non-conservative
amino acid changes in functionally important residues. As just one
example, the well-studied AAD gene family in yeast includes several
examples of both pseudogenes and genes with unknown functions, despite
their sequence similarity to aryl alcohol dehydrogenases [26].
Instead of discovering pseudogenes serendipitously, data collected
systematically from sequenced genomes, together with studies of
transcription patterns [27]
(and much laborious follow-up work to investigate the function of
duplicate genes) will transform our knowledge of gene family histories. These primary kinds of natural
history data are fascinating evolutionary information in themselves, but
genome sequences also allow tests of hypotheses about the evolutionary
causes of the events that are revealed by the primary observations (table
1), using comparisons of genes in related species. We should be able
to estimate the approximate time scale of processes such as changes in
gene family sizes, or loss of duplicate gene function, which can be
rapid. For instance, 40% of primate olfactory receptor genes seem to
have become pseudogenes in the past ten million years [9].
Provided that orthologous sequences can be identified (which may,
however, require detailed physical mapping, particularly if concerted
evolution happens [12]),
comparisons between species can also allow tests of whether sequences
remain functional. A much lower rate of replacement substitutions at
sites causing amino acid changes than at silent
sites suggests that the protein encoded by a sequence maintains its
expression. In contrast, more rapid amino acid than silent site
divergence may indicate a new or altered function after duplication. A
few such cases are detectable in yeast [18],
and this phenomenon is already well known from non-genomic studies of
gene families in animals [27]
and plants [16,28]. Selection and Adaptation at the
Molecular Level Large-scale comparisons of
sequences between species will give much improved information about the
frequency of adaptive changes. Humans and chimpanzees are estimated to
have roughly 250,000 amino acid differences between homologous genes [30],
plus many more in non-coding regions, but we have almost no idea which
differences were important in the evolution of modern humans or
chimpanzees. We do not know how much gain or loss of genes was involved,
nor whether regulatory interactions between genes have greatly changed.
The extent of differences between genomes can be estimated from samples
of genes, but only complete genome sequences give estimates free from
any biases and preconceived ideas about which sequences will differ.
Much greater sequence differences may sometimes be found [31]
than in the conserved regions that, until now, have been the only
sequences that could readily be studied. Most comparisons between
genomes have so far involved distantly related species (e.g. humans and
nematodes [32]),
but large samples of genes can now be compared between a few pairs of
closer, though still distant, relatives (e.g. mouse vs. human among the
mammals [32]
and different yeasts [34]). The hope is to distinguish,
among the many differences between genomes of species, those established
by selection from neutral differences fixed by genetic drift. This
cannot usually be done by direct study of gene functions or expression
patterns. The phenotypic effects of genes are currently largely
unpredictable, and selection may often be weaker than could be detected
experimentally, but sequence data can provide evidence for selection
during the evolutionary history of species, and between different
individuals within species (box
1). In genomes that are well enough studied for orthologues to be
identified confidently, genes with unusually high divergence (compared
with other genes in the same genomes) may be particularly interesting [28,35-37].
However, sequence divergence does not necessarily imply adaptation,
except in the rare cases when amino acids have diverged between
homologous sequences significantly more, per replacement
site, than silent sites [17,38].
Rapid evolution at particular nucleotide sites within sequences may also
indicate molecular adaptation, even if there is no overall excess of
replacement changes, and methods are being developed to identify such
sites using phylogenetic analyses [39]. High rates of sequence
divergence can, however, also occur through the accumulation of neutral
or weakly deleterious mutations in genes experiencing little selective
constraint [4].
In addition, for many genes, adaptation may have occurred through just
one or a few mutations, rather than in recurrent episodes. Such cases of
adaptation would not be detected by the phylogenetic approaches just
outlined. Information on patterns of within-species sequence variation
can sometimes provide invaluable tools for testing for selection (box
1), although this does not identify precisely the sites that are
actually the target of selection. Some cases of rapidly evolving genes (box
1) turn out to provide no evidence for adaptive evolution, once data
on variation within species is taken into account [37,40].
Comparisons between species are thus only the first step in
understanding adaptation. Large-scale studies that include
within-species variation are only starting to be done. Coding regions may, of course,
be the wrong place to look for evidence of adaptation. Certainly some,
and maybe most, adaptation at the DNA level must occur through changes
in gene regulation [41],
and other differences in non-coding sequences can also have phenotypic
consequences, by affecting chromatin or RNA structure [42].
We have little information on the extent to which non-coding sequences
are constrained by selection, but the kinds of tests outlined in box
1 can be applied to such sequence regions, given some previous data
on the functions of the sequences, such as knowledge of binding sites of
regulatory
proteins [43]. It is well established from
sequence analyses that even the use of different synonymous codons can
be subject to selection in Drosophila and other organisms [44,45].
As such selection is weak (of the order of the reciprocal of the effective
population size [44]),
accurate estimates of selection on codon usage will benefit from the
availability of the large sets of genes that will become available from
complete genome sequences. Selection on other types of silent nucleotide
differences may often be even weaker. Nevertheless, its intensity
relative to effective population size can, in principle, be estimated by
detailed analyses of patterns of within- and between-species differences
[44]. Mutation Rates As reviewed in reference
2, the numbers of genes in the genomes of several organisms are
proving to be smaller than expected. Nevertheless, the large numbers of
bases in the genomes of higher eukaryotes mean that many mutations must
occur per generation. It is important to have estimates of what fraction
of mutations creates deleterious alleles (ones that reduce fitness
sufficiently that they are certain to be eliminated from populations [4]:
box 2).
Knowledge of the rate per genome at which such mutations arise is a key
parameter for several important problems in evolutionary biology [7,46],
including how, on the one hand, organisms can survive mutation pressure
[47],
and why, on the other hand, gene function degenerates in some
circumstances (for instance, when genes are duplicated). Deleterious mutation rates can
be estimated from laborious experiments involving the accumulation of
mutations in laboratory stocks, under conditions where natural selection
is effectively prevented. However, interpreting these experiments is
difficult. Also they must miss many mutations with very slight effects
on fitness [48].
A new approach uses comparisons of homologous sequences between species
(box 2)
to estimate the proportion of sites in sequences at which mutations are
selectively eliminated, and then to calculate deleterious mutation rates
from estimates of total mutation rates. Data on amino acid replacements
in mammals, birds, and insects [30,46]
suggest that the human deleterious mutation rate exceeds one mutation
per individual per generation, but values in species with smaller
genomes and shorter generation times, such as Drosophila, are much
lower. The ability to compare whole
genome sequences of closely related (but somewhat divergent) species
will greatly enhance the reliability of point mutation rate estimates
from this method, as well as providing data on other mutations, such as
duplication and gene loss, mentioned above. The identification of a
large set of selectively unconstrained pairs of orthologous sequences
(such as pseudogenes) will also give reliable benchmarks for the neutral
mutation rate. This is particularly important as there is already
evidence for wide variation in mutation rates among different sequences
within genomes, such as those of mammals [30].
The extent of this variation, and its possible causal factors (including
regional differences in base composition, methylation,
and proximity to telomeres or centromeres) are, of course, interesting
questions in their own right. For these comparisons to be reliable, it
is critical that we compare orthologous rather than paralogous sequences
[figure
2]; conservation of gene order, known from complete genome
sequences, is the safest method of establishing homology [19,20].
Even more importantly, comparisons between species that are based on
complete genome sequences will allow comparisons of genes even if the
amino acid divergence of the genes is high. Currently there may be a
bias against detecting such genes, because genes are often detected in
new species based on sequence information about genes from well-studied
species. This bias would cause overestimation of the degree of
constraint, and hence of the deleterious mutation rate. Conclusions Despite the intrinsic interest
of the rapidly accumulating mass of finely detailed information about
genome sequences, understanding its implications will take a lot of
further work, much of which cannot be done on a high-throughput basis.
Even finding the genes among the enormous amounts of sequence can be
difficult and uncertain, and is biased towards genes similar to those
already known. Regions of the genome with high levels of repetitive DNA,
such as the pericentric heterochromatin,
have largely been avoided in sequencing projects [2,49],
so that work on the evolution of these regions will be slow, although
they may be important for studies of evolutionary questions [50].
Analysis of the patterns that are observed in the initial stages of
investigating complete genome sequences, and testing hypotheses
developed from these observations, will often require data on
within-species variability. The study of this variation on a large scale
creates technical challenges, which are just starting to be tackled [51,52].
Accuracy of sequencing is particularly critical when variability is low,
as in certain genomic regions, such as the centromeric regions of animal
and plant chromosomes, or in highly inbreeding species such as C.
elegans and A. thaliana [53],
or in species with low effective population sizes, such as humans [54]. As we have shown, sound
approaches exist for assessing how often the evolution of sequence
differences is driven by natural selection, and for detecting variation
that is actively maintained by selection. These population genetic
approaches can now be applied to large samples of genes, giving
increased accuracy and reduced bias. Evidence about the strength and
frequency of selection, and about which sites experience selection, will
advance our understanding of evolutionary mechanisms, and provide
hypotheses to be tested by functional approaches. A major goal for the
Human Genome Project is improving human health. Evolutionary biology has
much technical expertise to offer, as well as much to learn, in relation
to the origin of inherited diseases, the pathologies of old age, and
susceptibility to infectious disease. Deborah Charlesworth researches population genetics at the Institute of Cell, Animal and Population Biology at the University of Edinburgh. Brian Charlesworth researches population genetics at the Institute of Cell, Animal and Population Biology at the University of Edinburgh. Gilean A.T. McVean researches population genetics at the Department of Statistics at the University of Oxford.
|
|
|
|
1999-2005 中国科学院上海生命科学研究院生物信息中心 |