新闻 | 论坛 | 生物信息学专题 | 新思路 | 软件下载 | 相关数据库 | 免费主页

网站首页 BioSino Databese BioSino Lab BioSino Navigator 关于本站

 
站内搜索:  

Genome Sequences and Evolutionary Biology, a Two-Way Interaction

 

[编者的话]

比较基因组学是生物信息学研究的重要课题,下面这篇文章虽然发表在人类基因组测序结果公布之前,但文章中对比较基因组学研究的方向和重点的认识却充满了真知灼见。

 

by Deborah Charlesworth, Brian Charlesworth, and Gilean A.T. McVean

Complete genome sequences are accumulating rapidly, culminating with the announcement of the human genome sequence in February 2001. In addition to cataloguing the diversity of genes and other sequences, genome sequences will provide the first detailed and complete data on gene families and genome organization, including data on evolutionary changes. Reciprocally, evolutionary biology will make important contributions to the efforts to understand functions of genes and other sequences in genomes. Large-scale, detailed, and unbiased comparisons between species will illuminate the evolution of genes and genomes, and population genetics methods will enable detection of functionally important genes or sequences, including sequences that have been involved in adaptive changes.

Perhaps the most spectacular recent development in genetics has been the sequencing of complete genomes of prokaryotes and yeast and, recently, several further eukaryote species. The first aim of such projects is to catalogue the diversity of genes, regulatory sequences, and intervening regions. By classifying sequences into similar "families," and by assuming that conservation of sequence implies some degree of conservation of function, informed guesses can be made about the possible functional roles of newly discovered genes[2]. These can be fitted into our growing understanding of the genetic control of cellular and developmental processes and, ultimately, of biological diversity. Such a procedure rests implicitly on the theory of evolution, which interprets sequence similarity as generally implying shared ancestry, and views gene duplications as the ultimate source of evolutionary novelty and increased genome complexity.

Another major aim of genomics research is therefore to identify differences between genomes of species or individuals. One of the biggest surprises from whole genome sequencing is the high proportion of genes in each new species studied that cannot be recognized from knowledge of other genomes, and the differences in the repertoire of genes from species to species [1-4]. As a first step towards characterizing the genetic basis of adaptations, comparing complete genomes of related species should allow us to identify the genetic basis of particularly interesting diverged traits. Closely related to this is the identification of regions of functional significance by looking for conserved genomic regions, on the reasonable assumption that natural selection usually removes harmful mutations and preserves functionally important sequences [5]. No approach other than complete genome sequencing provides more than fragmentary data on conserved and diverged sequences. It is, however, a major evolutionary problem to distinguish between differences between species that are caused by selection pressures, and changes irrelevant to organismal fitness, which have been fixed by genetic drift.

The importance of evolutionary ideas for making use of genome sequences is already clear, and an increasing proportion of papers in journals devoted to genomic research mention evolution (21% of papers in the journal Genome Research in the year 2000). Evolutionary theory provides a framework for understanding the relationship between sequence divergence and change in gene function, but many evolutionary questions require study of only small samples of genes [6]. Here, we concentrate on questions that will benefit from the availability of entire genome sequences, touching on only a few important issues, and giving only a selection of recent references. Prokaryote genomes have been reviewed recently [7], so we focus on three areas where evolutionary approaches are particularly relevant to understanding eukaryote genomes and gene function: the evolution of gene families; the molecular basis of adaptation [8]; and estimating the genomic deleterious mutation rate [8]. Table 1 lists many other questions also illuminated by the new sequence data. For instance, no other approach can provide detailed information about genome rearrangements, such as inversions, which have become established over evolutionary time in the genomes of both eukaryotes and prokaryotes. Cytogenetic methods and genetic mapping either detect only large rearrangements, or are restricted to small genome regions [9]. Complete genome sequencing gives a view that is both high-magnification and wide in scope, providing a much more detailed picture of the evolutionary dynamics of whole genome organization. Eventually, this should lead to an understanding of the evolutionary forces affecting genome rearrangements.

Gene Families

A major part of understanding the evolution of genomes concerns gene families. These range in size from a few genes to many hundreds [10], and are known in genomes of all kinds of organisms. However, our knowledge of the abundance and scale of gene families is fragmentary, and we have only a poor understanding of the relative importance in real genomes of different evolutionary processes theoretically expected to affect non-single-copy genes. Only by studying complete genome sequences will we get complete information about how many genes of a given gene family exist in a sequenced genome, at all degrees of divergence from one another [11,12].

It is already clear that copy numbers of members of gene families are often underestimated by non-genomic approaches, because further genes are often found when detailed studies are done [12,13], and in eukaryote genomes that have been sequenced, such as Caenorhabditis elegans [11] and Arabidopsis thaliana [14], many duplicate genes are detectable [15]. Such data on the "natural history of genomes" are needed before we can ask about the rate and cause of gene turnover. For these studies, reliable evidence about which genes are orthologous (figure 1) must be obtained [1]. It is very difficult to be sure of orthology, because birth and death of genes occurs at a significant rate in many gene families. Well-studied examples include the mammalian major histocompatibility complex (MHC) genes [16], plant disease resistance genes [17] and Adh genes [18]. We have little understanding of why some gene families undergo birth and death processes more than others. This process greatly complicates sequence comparisons of several genome evolutionary processes that obscure gene relationships [14], because one cannot be sure that sequences are orthologous. In addition, gene conversion between different members of gene families often causes them to have similar sequences ("concerted evolution" [12]; see glossary). It can also accelerate the sequence evolution of one particular member of a gene family, because information from a different member is incorporated [12]. Phylogenetic inference may therefore need to be restricted to sequences that are distinctive enough to be treated as single-copy loci, and that persist over long evolutionary time scales.

Genome sequences are providing unexpected data about one source of gene family expansion, large-scale duplications [14,19]. Duplications of large genome segments, or entire genome duplication, can be detected when two or more sets of similar genetically linked, or physically similarly arranged, genes are found. In yeast, such findings suggest that the entire genome was duplicated [19]. The A. thaliana genome contains extensive duplications between different nuclear regions [13,20], and the second chromosome contains a large duplicated portion of the mitochondrial genome [21]. This confirms earlier evidence for duplication of organelle genome sequences in the nuclear genome [22], but it was not previously known how long these sequences could be. Such unusual genomic evolutionary events, and the relative frequencies of duplications producing tandem versus dispersed repeats of genes, are hard to determine other than from complete genome sequences and, even then, clear inferences are difficult to make [23]. Knowledge of the frequency and nature of such events will significantly advance our understanding of genome evolution.

Genome sequences also provide information about the evolutionary events following the appearance of new members of gene families. Only 8% of the yeast genes present after the duplication 10 [8] million years ago seem to have been retained, mostly as very similar, functionally redundant duplicates [18]. Why certain duplicate genes are retained is unknown. Sometimes, advantageous mutations may lead to new functions [23]. Sometimes, high activity of the proteins encoded by duplicate genes is beneficial, but duplicate genes may not always perform redundant, or even largely equivalent, functions. In yeast, the only organism where this has been tested, sequence similarity does not guarantee similar expression patterns [24]. Whole genome sequences also provide information about loss of duplicated genes, and have provided many examples of functionally silent pseudogenes, with mutations such as stop codons, frame shifts, or non-conservative amino acid changes in functionally important residues. As just one example, the well-studied AAD gene family in yeast includes several examples of both pseudogenes and genes with unknown functions, despite their sequence similarity to aryl alcohol dehydrogenases [26]. Instead of discovering pseudogenes serendipitously, data collected systematically from sequenced genomes, together with studies of transcription patterns [27] (and much laborious follow-up work to investigate the function of duplicate genes) will transform our knowledge of gene family histories.

These primary kinds of natural history data are fascinating evolutionary information in themselves, but genome sequences also allow tests of hypotheses about the evolutionary causes of the events that are revealed by the primary observations (table 1), using comparisons of genes in related species. We should be able to estimate the approximate time scale of processes such as changes in gene family sizes, or loss of duplicate gene function, which can be rapid. For instance, 40% of primate olfactory receptor genes seem to have become pseudogenes in the past ten million years [9]. Provided that orthologous sequences can be identified (which may, however, require detailed physical mapping, particularly if concerted evolution happens [12]), comparisons between species can also allow tests of whether sequences remain functional. A much lower rate of replacement substitutions at sites causing amino acid changes than at silent sites suggests that the protein encoded by a sequence maintains its expression. In contrast, more rapid amino acid than silent site divergence may indicate a new or altered function after duplication. A few such cases are detectable in yeast [18], and this phenomenon is already well known from non-genomic studies of gene families in animals [27] and plants [16,28].

Selection and Adaptation at the Molecular Level

Large-scale comparisons of sequences between species will give much improved information about the frequency of adaptive changes. Humans and chimpanzees are estimated to have roughly 250,000 amino acid differences between homologous genes [30], plus many more in non-coding regions, but we have almost no idea which differences were important in the evolution of modern humans or chimpanzees. We do not know how much gain or loss of genes was involved, nor whether regulatory interactions between genes have greatly changed. The extent of differences between genomes can be estimated from samples of genes, but only complete genome sequences give estimates free from any biases and preconceived ideas about which sequences will differ. Much greater sequence differences may sometimes be found [31] than in the conserved regions that, until now, have been the only sequences that could readily be studied. Most comparisons between genomes have so far involved distantly related species (e.g. humans and nematodes [32]), but large samples of genes can now be compared between a few pairs of closer, though still distant, relatives (e.g. mouse vs. human among the mammals [32] and different yeasts [34]).

The hope is to distinguish, among the many differences between genomes of species, those established by selection from neutral differences fixed by genetic drift. This cannot usually be done by direct study of gene functions or expression patterns. The phenotypic effects of genes are currently largely unpredictable, and selection may often be weaker than could be detected experimentally, but sequence data can provide evidence for selection during the evolutionary history of species, and between different individuals within species (box 1). In genomes that are well enough studied for orthologues to be identified confidently, genes with unusually high divergence (compared with other genes in the same genomes) may be particularly interesting [28,35-37]. However, sequence divergence does not necessarily imply adaptation, except in the rare cases when amino acids have diverged between homologous sequences significantly more, per replacement site, than silent sites [17,38]. Rapid evolution at particular nucleotide sites within sequences may also indicate molecular adaptation, even if there is no overall excess of replacement changes, and methods are being developed to identify such sites using phylogenetic analyses [39].

High rates of sequence divergence can, however, also occur through the accumulation of neutral or weakly deleterious mutations in genes experiencing little selective constraint [4]. In addition, for many genes, adaptation may have occurred through just one or a few mutations, rather than in recurrent episodes. Such cases of adaptation would not be detected by the phylogenetic approaches just outlined. Information on patterns of within-species sequence variation can sometimes provide invaluable tools for testing for selection (box 1), although this does not identify precisely the sites that are actually the target of selection. Some cases of rapidly evolving genes (box 1) turn out to provide no evidence for adaptive evolution, once data on variation within species is taken into account [37,40]. Comparisons between species are thus only the first step in understanding adaptation. Large-scale studies that include within-species variation are only starting to be done.

Coding regions may, of course, be the wrong place to look for evidence of adaptation. Certainly some, and maybe most, adaptation at the DNA level must occur through changes in gene regulation [41], and other differences in non-coding sequences can also have phenotypic consequences, by affecting chromatin or RNA structure [42]. We have little information on the extent to which non-coding sequences are constrained by selection, but the kinds of tests outlined in box 1 can be applied to such sequence regions, given some previous data on the functions of the sequences, such as knowledge of binding sites of regulatory proteins [43].

It is well established from sequence analyses that even the use of different synonymous codons can be subject to selection in Drosophila and other organisms [44,45]. As such selection is weak (of the order of the reciprocal of the effective population size [44]), accurate estimates of selection on codon usage will benefit from the availability of the large sets of genes that will become available from complete genome sequences. Selection on other types of silent nucleotide differences may often be even weaker. Nevertheless, its intensity relative to effective population size can, in principle, be estimated by detailed analyses of patterns of within- and between-species differences [44].

Mutation Rates

As reviewed in reference 2, the numbers of genes in the genomes of several organisms are proving to be smaller than expected. Nevertheless, the large numbers of bases in the genomes of higher eukaryotes mean that many mutations must occur per generation. It is important to have estimates of what fraction of mutations creates deleterious alleles (ones that reduce fitness sufficiently that they are certain to be eliminated from populations [4]: box 2). Knowledge of the rate per genome at which such mutations arise is a key parameter for several important problems in evolutionary biology [7,46], including how, on the one hand, organisms can survive mutation pressure [47], and why, on the other hand, gene function degenerates in some circumstances (for instance, when genes are duplicated).

Deleterious mutation rates can be estimated from laborious experiments involving the accumulation of mutations in laboratory stocks, under conditions where natural selection is effectively prevented. However, interpreting these experiments is difficult. Also they must miss many mutations with very slight effects on fitness [48]. A new approach uses comparisons of homologous sequences between species (box 2) to estimate the proportion of sites in sequences at which mutations are selectively eliminated, and then to calculate deleterious mutation rates from estimates of total mutation rates. Data on amino acid replacements in mammals, birds, and insects [30,46] suggest that the human deleterious mutation rate exceeds one mutation per individual per generation, but values in species with smaller genomes and shorter generation times, such as Drosophila, are much lower.

The ability to compare whole genome sequences of closely related (but somewhat divergent) species will greatly enhance the reliability of point mutation rate estimates from this method, as well as providing data on other mutations, such as duplication and gene loss, mentioned above. The identification of a large set of selectively unconstrained pairs of orthologous sequences (such as pseudogenes) will also give reliable benchmarks for the neutral mutation rate. This is particularly important as there is already evidence for wide variation in mutation rates among different sequences within genomes, such as those of mammals [30]. The extent of this variation, and its possible causal factors (including regional differences in base composition, methylation, and proximity to telomeres or centromeres) are, of course, interesting questions in their own right. For these comparisons to be reliable, it is critical that we compare orthologous rather than paralogous sequences [figure 2]; conservation of gene order, known from complete genome sequences, is the safest method of establishing homology [19,20]. Even more importantly, comparisons between species that are based on complete genome sequences will allow comparisons of genes even if the amino acid divergence of the genes is high. Currently there may be a bias against detecting such genes, because genes are often detected in new species based on sequence information about genes from well-studied species. This bias would cause overestimation of the degree of constraint, and hence of the deleterious mutation rate.

Conclusions

Despite the intrinsic interest of the rapidly accumulating mass of finely detailed information about genome sequences, understanding its implications will take a lot of further work, much of which cannot be done on a high-throughput basis. Even finding the genes among the enormous amounts of sequence can be difficult and uncertain, and is biased towards genes similar to those already known. Regions of the genome with high levels of repetitive DNA, such as the pericentric heterochromatin, have largely been avoided in sequencing projects [2,49], so that work on the evolution of these regions will be slow, although they may be important for studies of evolutionary questions [50]. Analysis of the patterns that are observed in the initial stages of investigating complete genome sequences, and testing hypotheses developed from these observations, will often require data on within-species variability. The study of this variation on a large scale creates technical challenges, which are just starting to be tackled [51,52]. Accuracy of sequencing is particularly critical when variability is low, as in certain genomic regions, such as the centromeric regions of animal and plant chromosomes, or in highly inbreeding species such as C. elegans and A. thaliana [53], or in species with low effective population sizes, such as humans [54].

As we have shown, sound approaches exist for assessing how often the evolution of sequence differences is driven by natural selection, and for detecting variation that is actively maintained by selection. These population genetic approaches can now be applied to large samples of genes, giving increased accuracy and reduced bias. Evidence about the strength and frequency of selection, and about which sites experience selection, will advance our understanding of evolutionary mechanisms, and provide hypotheses to be tested by functional approaches. A major goal for the Human Genome Project is improving human health. Evolutionary biology has much technical expertise to offer, as well as much to learn, in relation to the origin of inherited diseases, the pathologies of old age, and susceptibility to infectious disease.

Deborah Charlesworth researches population genetics at the Institute of Cell, Animal and Population Biology at the University of Edinburgh.

Brian Charlesworth researches population genetics at the Institute of Cell, Animal and Population Biology at the University of Edinburgh.

Gilean A.T. McVean researches population genetics at the Department of Statistics at the University of Oxford.


1999-2005 中国科学院上海生命科学研究院生物信息中心  
版权所有 All rights reserved.