|
进化研究:人类基因组中的重复片段 |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
[编者的话] 选自近期science的文章,原文名为:Recent Segmental Duplications in the Human Genome。作者发展了一种检查、分辨基因组内部重复片段的方法,分析结果证实重复片段的分布是非随机的,它们可能与蛋白功能进化密切联系。
Primate-specific
segmental duplications are considered important in human disease and
evolution. The inability to distinguish between allelic and
duplication sequence overlap has hampered their
characterization as well as assembly and annotation of our genome.
We developed a method whereby each public sequence is analyzed
at the clone level for overrepresentation within a whole-genome shotgun
sequence. This test has the ability to detect duplications larger
than 15 kilobases irrespective of copy number, location, or
high sequence similarity. We mapped 169 large regions flanked by
highly similar duplications. Twenty-four of these hot spots of
genomic instability have been associated with genetic disease. Our
analysis indicates a highly nonrandom chromosomal and genic distribution
of recent segmental duplications, with a likely role in
expanding protein diversity. Initial
analyses of the human genome sequence have identified a large amount of
interspersed as well as tandem segmental duplications (1-3).
These observations raise the possibility that segmental
duplications may have played a significant role in gene and
genome evolution compared with whole-genome duplication models
(4).
Furthermore, segmental duplications may underlie a greater
amount of human phenotypic variation and disease than was
previously recognized (5,
6).
Unfortunately, duplicated regions of the genome are
marginalized within both private and public assemblies (7).
The overarching problem stems from the inability of current
assembly strategies to differentiate highly similar duplicated
sequence from true overlaps that remain unassembled. Using
computational methods, we have developed a simple statistical test to
determine whether a given stretch of sequence is duplicated
based on its overrepresentation and average sequence identity
within a random sample of genomic sequence. Comparing a unique
sequence with a random sample will detect a limited number of
highly identical sequence matches. In contrast, a duplicated sequence
will also detect paralogous matches, increasing the overall number
of sequence alignments and decreasing the average pairwise sequence
identity. The power of such an approach requires that the
sample be randomly distributed and as large as possible. Currently, the
largest sample available for these purposes is the about fivefold coverage
of whole-genome shotgun (WGS) reads generated by Celera Genomics
(3). To
test the random nature of this data set, we initially analyzed 27 autosomal
and X chromosomal loci that had been determined to be unique by
experimental analysis (Table
1) (table S1) (8).
Genomic sequence from a public GenBank accession was used as a
reference and compared against the WGS sequences over 5-kb
windows, sliding every 1 kb across the accession. Within the
unique control set, both the average read depth and average sequence
identity were tightly distributed around their respective means
indicative of a random sample of WGS reads. Next, we compared these
statistics with 14 known loci (Table
1) (table S1) that contain recent (<40 million years
ago) segmental duplications of various sizes, copy number, and
divergence (9).
We observed a significant increase in depth of coverage and
significant decrease in sequence identity (Table
1), although the latter became more insensitive as the
sequence identity of the duplicates approached 100%. Moreover,
graphic visualization of both statistics allowed duplicated
portions within the reference clones to be easily discerned
within 2 kb of previously characterized junctions (Fig.
1A) (fig. S1). For known duplications with experimentally
determined copy number, we assessed the depth of coverage
specifically over the duplicated segments. The number of reads
within 5-kb windows correlated strongly with the copy number (Fig.
1B; R2 = 0.96). These data indicate that
the WGS library is sufficiently deep and random to develop a
duplication metric for large, highly homologous segmental
duplications.
Fig.
1. WGS
sequence detection of segmental duplications. (A) A genomic
reference sequence (U52111) containing a 26.5-kb creatine transporter
(SLC6A8) and 9.7-kb adrenoleukodystrophy (ABCD1) segmental duplication was
used to search all WGS reads (Celera) using the combining assembler
algorithm (3).
This analysis was performed independently of the Celera assembly of the
human genome. A multiple alignment (>94% sequence identity) was
constructed and the number of reads and average sequence identity were
calculated across 5-kb windows. The number of reads (x axis bottom)
begins to rise and the average sequence identity (x axis top) drops
precipitously, precisely at the known transition regions between unique
and duplicated sequence (red horizontal line represents the X chromosomal
threshold set at 3 SD above the mean depth coverage for unique X
chromosome sequence). Both segmental duplications are readily identified.
LINES and SINES are long and short interspersed repeat elements,
respectively; also shown is a scale in 10-kb increments. (B)
Correlation of number of WGS reads and known diploid copy number of
genomic segment. The number of reads for each 5-kb window overlying known
duplications (
We
chose to analyze independently each genomic accession underlying the
public assembly of the human genome. We compared each sequence
(32,610 clones) against the random WGS read data (27.3 million
reads) and constructed a multiple sequence alignment based on
the recruitment of sequence reads with >94% sequence identity. We
computed the average degree of sequence identity and the depth of
coverage in sliding windows of 5 kb along the alignment. The distribution
of random reads and test statistics is available for each clone
(10).
In our analysis, we extracted all regions exceeding defined
thresholds as potential segmental duplications and analyzed the
read distribution to precisely delineate the boundaries of each
duplicated region (Fig.
1A). We set our thresholds of duplication detection at 81 reads
per 5 kb for autosomes and 47 reads per 5 kb for
the sex chromosomes (3 SD beyond the mean, based on our
analysis of unique regions) (Table
1) (table S1). With such a database of duplicated sequence,
other sequences or assemblies could be screened and the
positions of highly similar duplications determined. A
consensus sequence from the multiple sequence alignment (both
the public clone and WGS reads) was constructed if the clone
showed an increased read depth (8).
The consensus is analogous to consensus sequence for common
repeat elements. The resulting segmental duplication database
contains 8595 regions representing 130.5 megabases (Mb) of
DNA. This sequence database is available [(10);
see also the August 2001 assembly browser at the
University of California, Santa Cruz (UCSC)]. We
tested the power of this method to detect duplications in three ways.
First, we analyzed the depth of coverage across human chromosome
22, whose segmental duplication pattern has been extensively characterized
(fig. S2) (11-13).
Unique regions (28 Mb of sequence) showed a narrow
distribution of 50.4 ± 12.8 reads per 5 kb,
which attests to the uniform nature of the WGS reads. Observed
increases in read number that were false positive were almost
exclusively due to the presence of high-copy number repeats,
which were then filtered (8).
Within duplicated regions, all duplications >10 kb and with
>95% similarity had demonstrable increases in the number of
reads per 5 kb. Second, we analyzed a set of duplicated
BACs that had duplications detectable by fluorescence in situ
hybridization (FISH) that also had been sequenced (table S2).
We identified 36/37 of these BACs as duplicated based on our
standards, which suggests a false-negative rate of 2.5%. A
reciprocal experiment analyzing large-insert clones that tested
positive with WGS detection (WSSD) showed 13/14 as being duplicated
by metaphase and/or interphase FISH analysis (table S3). As a
final test of sensitivity, we examined whether our thresholds could
detect well-characterized duplications from the literature (table
S4) (6,
14,
15).
We analyzed a total of 27 genomic regions and detected all
duplications of >15 kb and with >95% identity, many of
which are associated with known genomic disorders. Because of
our initial alignment parameters (8),
duplications with a sequence identity of <94% were not
reliably predicted within this set. Such duplications, however, are
easily identified by genome assembly comparisons (see below). Next,
we performed a whole-genome assembly comparison (WGAC) to detect
duplications (pairwise alignments
We
also examined the impact of duplications on single nucleotide polymorphism
(SNP) discovery by analyzing the content of the the public SNP
database (dbSNP) as placed on the UCSC assembly (16).
We hypothesized that when duplications remain unrecognized,
paralogous sequence variants may be falsely identified as SNPs.
This would increase the apparent density of "SNPs" within duplicated
regions. The average SNP density was indeed increased in
duplicated regions compared with unique regions (1.33 versus 0.69 SNP
per kb, respectively; table S6). Because there is no reason to
expect that polymorphic variation is increased within duplicated
regions, the approximate doubling of SNP density suggests that
roughly one of two SNPs is, in fact, a paralogous sequence variant
rather than an allele. Current in silico methods examining sequence
overlaps account for most of these false positives (table S6).
We estimate that about 100,000 paralogous sequence variants currently
contaminate dbSNP. Nonallelic
homologous recombination between blocks of duplicated sequence leads to
microdeletion, microduplication, and inversion of genomic
segments. If genes flanked by these duplications are rearranged,
disease may result (17-20).
To identify such potential regions of genomic instability, we assessed
the pattern of intrachromosomal duplication (Fig.
2). The most prevalent disorders usually involve
duplications that are >95% similar and >10 kb, separated
by 50 kb to 10 Mb of DNA (6).
Compiling the regions encompassed by duplications meeting these
criteria creates a genome map of likely rearrangement hot spots
(Fig.
2; gold bars below sequence). We identified a total of 169 regions
constituting roughly one-tenth of the genome (298 Mb).
Twenty-four of these regions have already been associated with
genomic disorders. Fig.
2. Patterns of intrachromosomal and interchromosomal duplication
(
Different
human chromosomes appear to show distinct landscapes for segmental
duplication (Fig.
2). Although interchromosomal duplications within
pericentromeric and subtelomeric regions are well documented (5,
21),
these biases have not been observed for all chromosomes. It
appears that many pericentromeric regions such as 3p, 3q, 4p,
4q, 5p, 6q, 8p, 8q, 12p, 18q, 20q, Xp, and Xq are quiescent,
showing no sign of recent duplication between chromosomes (Fig.
2) (fig. S4). Subtelomeric regions also show variability in
duplication content. Final assessment must await further
completion of the reference sequence because duplicated
pericentromeric and subtelomeric regions are underrepresented relative
to the rest of the genome. To
assess the duplication distribution more directly, we developed a random
genome model of segmental duplication. The genome was
partitioned into 2881 segments of 100 kb (fig. S3 and table
S5), genome sequence was randomly assigned to each bin, and the
duplication content for each chromosome was calculated (n = 10,000
replicates). Human chromosomes 7, 9, 15, 16, 17, 19, 22, and
Y were significantly enriched for both inter- and
intrachromosomal duplications, whereas chromosomes 2, 3, 4, 5, 8, 14, and
20 appeared to be significantly reduced for segmental
duplication content (P < 0.0001). Such
variation was not due to the finished state of the chromosomes
with which there is no correlation (R2 = 0.04)
(Fig.
2) (fig. S4). It
has been argued that duplications may occur simply as a result of relaxed
negative selection in gene-poor regions that have no function;
thus, a negative correlation between gene density and
duplication content would be expected for chromosomes (22).
In fact, a significant positive, rather than negative, correlation
is seen when the relative gene density is compared with chromosomal
duplication content (R2 = 0.16). The
correlation was due to intrachromosomal duplications (fig. S5; R2 = 0.20;
P = 0.04; F test) and was absent for interchromosomal
duplications (R2 = 0.002). The three
most gene-rich chromosomes showed high levels of duplication,
and the seven most gene-poor chromosomes were among the least
duplicated chromosomes. To
determine what role recent segmental duplications have played in current
gene evolution, we characterized the gene content in our
filtered set of duplicated genomic sequence. We analyzed a
highly curated set of 13,351 mRNAs assigned to the human genome assembly
(RefSeq, www.ncbi.nlm.nih.gov/LocusLink/refseq.html).
We partitioned exons from each gene into a unique or duplicated
sequence on the basis of their map position (>90% sequence
identity). We identified a total of 7777 exons as being
transcribed from recently duplicated sequence, corresponding to
6.1% of all RefSeq exons (128,467). This is slightly greater
than the genomic representation of segmental duplication
(5.2%), which confirms that gene-poor regions have not been
preferentially duplicated. In many cases, a complete complement
of exons was not duplicated. These incomplete duplicated genes
were often found adjacent to other duplicated cassettes that
originated from elsewhere in the genome. By comparing our data
with human expressed sequence tag databases, we found evidence
for "chimeric" or fusion transcripts that emerged from the
physical juxtaposition of incomplete segmental duplications. Although
the mechanism for recent segmental duplications is not understood,
the existing data suggest the process may play a role in exon
shuffling associated with expanding protein diversity. A
complete list of all genes with one more exons within duplicated genomic
sequence is available (8).
To
further assess whether specific kinds of genes or biological processes
have been preferentially duplicated, we compared all RefSeq
mRNAs on the basis of their INTERPRO protein domain classification
(Table
2) (table S7) (23).
In this analysis, we considered a gene duplicated only if all
its exons were contained within a duplicated genomic region.
Our analysis suggests a nonrandom distribution of segmental
duplications within the proteome. Genes associated with
immunity and defense (natural killer receptors, defensins,
interferons, serine proteases, cytokines), membrane surface
interactions (galectins, HLA, lipocalins, carcinoembryonic antigens),
drug detoxification (cytochrome P450), and growth/development (somatotropins,
chorionic gonadotropins, pregnancy-specific glycoproteins) were
particularly enriched. It should be emphasized that our gene analysis
is restricted to genomic segments that show
Jeffrey A. Bailey,1 Zhiping Gu,2 Royden A. Clark,1 Knut Reinert,2 Rhea V. Samonte,1 Stuart Schwartz,1 Mark D. Adams,2 Eugene W. Myers,2 Peter W. Li,2 Evan E. Eichler1*
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
1999-2005 中国科学院上海生命科学研究院生物信息中心 |