|
Correlating overrepresented upstream motifs to gene expression |
|
[编者的话] 转录水平的调控是真核生物中主要的调控机制。文章介绍了一种发现调控元件的计算方法。通过对来自于芯片的数据进行分析以寻找编码区上游元件是近期的研究热点。
Abstract Background Gene regulation in eukaryotes is mainly effected
through transcription factors binding to rather short recognition motifs
generally located upstream of the coding region. We present a novel
computational method to identify regulatory elements in the upstream
region of eukaryotic genes. The genes are grouped in sets sharing an
overrepresented short motif in their upstream sequence. For each set,
the average expression level from a microarray experiment is determined:
If this level is significantly higher or lower than the average taken
over the whole genome, then the overerpresented motif shared by the
genes in the set is likely to play a role in their regulation. Results The method was tested by applying it to the genome of Saccharomyces
cerevisiae, using the publicly available results of a DNA microarray
experiment, in which expression levels for virtually all the genes were
measured during the diauxic shift from fermentation to respiration.
Several known motifs were correctly identified, and a new candidate
regulatory sequence was determined. Conclusions We have described and successfully tested a simple
computational method to identify upstream motifs relevant to gene
regulation in eukaryotes by studying the statistical correlation between
overepresented upstream motifs and gene expression levels. Introduction One of the biggest challenges of modern genetics is to
extract biologically meaningful information from the huge mass of raw
data that is becoming available. In particular, the availability of
complete genome sequences on one hand, and of genome-wide microarray
data on the other, provide invaluable tools to elucidate the mechanisms
underlying transcriptional regulation. The sheer amount of available
data and the complexity of the mechanisms at work require the
development of specific data analysis techniques to identify statistical
patterns and regularities, that can then be the subject of experimental
investigation. The regulation of gene expression in eukaryotes is
known to be mainly effected through transcription factors binding to
rather short recognition motifs generally located upstream of the coding
region. One of the main problems in studying regulation of gene
expression is to identify the motifs that have transcriptional meaning,
and the genes each motif regulates. The usual approach to this kind of analysis begins by
identifying groups of co-regulated genes, for example by applying
clustering techniques to the expression profiles obtained from
microarray experiments. One then studies the upstream sequences of a set
of coregulated genes looking for shared motifs. Examples of this
approach as applied to S. cerevisiae are Refs. [1,2,4]. In this paper we suggest an alternative method which
somehow follows the inverse route: genes are grouped into (non-disjoint)
sets, each set being characterized by a short motif which is
overrepresented in the upstream sequence. For each set, the average
expression is computed for a certain microarray experiment, and compared
to the genome-wide average expression from the same experiment. If a
statistically significant diffrence is found, then the motif that
defines the set of genes is a candidate regulatory sequence. The
rationale for looking for overrepresented motifs is that, in many
instances, regulatory motifs are known to appear repeated many times
within a relatively short upstream sequence [2,3],
so that the number of repetitions turns out to be much bigger than what
would be expected from chance alone. A somehow related approach, which does not require any
previous grouping of genes based on their expression profiles, was
presented in Ref. [5],
where the effect of upstream motifs on gene expression levels is modeled
by a sum of activating and inhibitory terms. Experimental expression
levels are then fitted to the model, and statistically significant
motifs are identified. Our approach differs in the importance given to
overrepresented motifs, thus considering activation and inhibition as an
effect that depends on a threshold number of repetitions of a motif
rather than on additive contributions from all motifs. Clearly the two
mechanisms are far from being mutually exclusive, therefore we expect
the candidate regulatory sites found with the two methods to
significantly overlap. However it is important to notice that the kind of
statistical correlation between upstream motifs and expression that our
algorithm identifies does not depend on any special assumption on the
functional dependence of expression levels on the number of motif
repetitions, as long as this dependence is strong enough to provide a
significant deviation from the average expression when enough copies of
the motif are present. A comparison of our results with those obtained
in Ref. [5]
is provided in the "Results and discussion" section. The method In general the motifs with known regulatory function
are not identified with a fixed nucleotide sequence, but rather with
sequences where substitutions are allowed, or spaced dyads of fixed
sequences, etc. However in this study, in order to test the method while
keeping the technical complications to a minimum, we will limit
ourselves to fixed short nucleotide sequences, that we call words.
While previous studies (see e.g [2])
show that even this simple analysis can give interesting results, the
method we present can easily be generalized to include variable
sequences and other more complicated patterns. The computational method we propose has two main
steps: first the open reading frames (ORFs) of an eukaryote genome are
grouped in (overlapping) sets based on words that are overrepresented in
their upstream region, compared to their frequencies in the reference
sample made of all the upstream regions of the whole genome. Each set is
labelled by a word. Then for each of these sets the average expression
in one or more microarray experiments are compared to the genome-wide
average: if a statistically significant difference is found, the word
that labels the set is a candidate regulatory site for the genes in the
set, either enhancing or inhibiting their expression. It is worth stressing that the grouping of the genes
into sets depends only on the upstream sequences and not on the
microarray experiment considered: It needs to be done only once for each
organism, and can then be used to analyse an arbitrary number of
microarray experiments. It is precisely this fact that should allow the
extension of the method to patterns more complex than fixed sequences,
while keeping the required computational resources within reasonable
limits. Constructing the sets We consider the upstream region of each open reading
frame (ORF), and we fix the maximum length K of the upstream
sequence to be considered. The choice of K depends on the typical
location of most regulatory sites: in general K is a number
between several hundred and a few thousand. For each ORF g, the
actual length of the sequence we consider is Kg
defined as the minimum between K and the available upstream
sequence before the coding region of the previous gene. For each word w of length l (6 ≤ l ≤ 8 in this study), and for each ORF g we
compute the number mg(w) of occurrences of w
in the upstream region of g. Non palindromic words are counted on
both strands: therefore we define the effective number of occurrences ng(w)
as
where
We define the global frequency p(w) of
each word w as
where, in order to count correctly the available space
for palindromic and non palindromic words,
p(w) is therefore the frequency
with which the word w appears in the upstream regions of the
whole genome: it is the "background frequency" against which
occurrences in the upstream regions of the individual genes are compared
to determine which words are overrepresented. For each ORF g and each word w we
compute the probability bg(w) of finding ng(w)
or more occurrences of w based on the global frequency p(w):
We define a maximum probability P, depending in
general on the length l of the words under consideration, and
consider, for each w, the set
of the ORFs in which the word w is
overepresented compared to the frequency of w in the upstream
regions of the whole genome. That is, w is considered
overrepresented in the upstream region of g if the probability of
finding ng(w) or more instances of w
based on the global frequency is less than P. This completes the construction of the sets S(w).
Two free parameters have to be fixed: the length K of the
upstream region to be considered and the probability cutoff P for
each length l of words considered. A result in Ref. [2]
suggests suitable choices of these two numbers: the authors list the 34
ORFs of S. cerevisiae that have 3 or more occurrences of the word
GATAAG in their 500 bp upstream region. 23 out of these 34 ORFs
correspond to a gene with known function, and 20 out of these 23 are
regulated by nitrogen. This result suggests to choose K = 500 for
the upstream length, and a value of the probability cutoff such that
three or more instances of GATAAG in the 500 bp upstream region of an
ORF are considered significant. Any choice of P between 0.018 and
0.1 would satisfy this criterion, and we chose P = 0:02.
Tentatively, we kept the same value of P for all values of l.
With this choice, the number of instances of a word that are necessary
to be considered overrepresented in a 500 bp upstream sequence can be as
high as six for common 6-letter words and as low as one for rare
8-letter words. In particular, our set S(GATAAG) almost1
coincides with the one discussed in [2].
However the word GATAAG will not turn out to be significant in our
study. As noted above, it would be natural to make the
probability cutoff P depend on the word length, simply because
the number of possible words increases with their length: For example
one could take the cutoff for each word length to be inversely
proportional to the number of independent words of such length. However
it turns out that this procedure tends to construct sets that are less
significant when tested for correlation with expression. Therefore we
chose to fix the cutoff at 0.02 for all word lengths. It is important to
keep in mind that no statistical significance whatsoever is attributed
to the sets per se: The only sets that are retained at the end of
the analysis are the ones that show significant correlation with
expression. Therefore the choice of the cutoff in the construction of
the sets can be based on such a pragmatic approach without jeopardizing
the statistical relevance of the final result. Studying the average expression level in each set The second step of our procedure consists in studying,
for each set S(w) defined as above, the expression
profiles of the ORFs belonging to S(w) in DNA microarray
experiments. The idea is that if the average expression profile in the
set S(w) for a certain experiment is significantly
different from the average expression for the same experiment computed
on the whole genome, then it is likely that some of the ORFs in S(w)
are coregulated and that the word w is a binding site for
the common regulating factor. To look for such instances we consider the gene
expression profiles during the diauxic shift, i.e. the
metabolic shift from fermentation to respiration, as measured with DNA
microarrays techniques in Ref. [1].
In the experiment gene expression levels were measured for virtually all
the genes of S. Cerevisiae at seven time-points while such
metabolic shift took place. The experimental results are publicly
available from the web supplement to Ref. [1]. We considered each time-point as a single experiment,
and for each gene g we defined the quantity rg(i)
(1 = 1,..., 7) as the log2 of the ratio between the mRNA
levels for the gene g at time-point i and the initial mRNA
level. Therefore e.g. rg(i) = 1 means a
two-fold increase in expression at timepoint i compared to
initial expression. For each time-point i we computed the
genome-wide average expression R(i) and its standard
deviation σ (i). These
are reported in Tab. 1,
where N(i) is the number of genes with available
expression value for each timepoint. Then for each word w we
compute the average expression in the subset of Sw
given by the genes for which an experimental result is available at
timepoint i (in most cases this coincides with Sw):
where N(i, w) is the number of ORFs in Sw
for which an experimental result at timepoint i is available, and
the difference
ΔRw(i)
is the discrepancy between the genome-wide average expression at
time-point i and the average expression at the same time-point of
the ORFs that share an abundance of the word w in their upstream
region. A significance index sig(i, w) is defined as
and the word w is considered significantly
correlated with expression at time point i if
In this work we chose Λ = 6: this means that we consider meaningful a
deviation of Rw(i) by six s.d.'s from its
expected value. The sign of sig(i, w) indicates whether w
acts as an enhancer or an inhibitor of gene expression. Results and discussion We found a total of 29 words of length between 6 and 8
above our significance threshold |sig| > 6. Most of them are related
to known regulatory motifs; two words turned out to be false positives
due to the presence, in their sets, of families of identical ORF's.
Finally, one word does not match any known motif and is a candidate new
binding site. The comparison between our significant words and known
motifs was performed using the database of regulatory motifs made
publicly available by the authors of Ref. [6],
and the CompareACE software [7]
available from the same web source. This package allowed us to compute
the Pearson correlation coeffcient of the best alignment between each of
our significant words and each known regulatory motif (expressed as a
set of nucleotide frequencies). We used the following criterion to associate our
significant words to known motifs: a motif is considered as identified
if at least one significant word scores better than 0.8 when compared to
it. A probability value for this choice of the cutoff can be estimated
to be a few percent: out of all the 2080 independent 6-letter words, 66
(that is 3.17%) score better than 0.8 with at least one motif. For 7-
and 8-letter words we have respectively 2.21% and 1.51%. Once a motif
has been identified, all words which score best with the motif are
attributed to it, independently of the score, provided their expression
pattern is consistent with the word(s) scoring better than 0.8. PAC and RRPE motifs Nine significant words can be associated to the PAC
motif [8,4,7],
all of them with rather high scores. They are shown in Tab. 2,
where, as in all the following tables, significativity indices are shown
only for those timepoints where they exceed our threshold jsigj
> 6. Given the perfect alignment of these words, it is not surprising
that these sets largely overlap each other: The union af all the nine
sets contains a total of 96 genes. As an example, in Fig. 1
we show the average expression for the genes associated with the word
GATGAG as a function of the time, compared to the average expression
computed over the whole genome. Fig. 2
shows the significance index for the same set. The genes in this set are
shown in Tab. 3
togather with their expression profiles. Two words can be associated with confidence to the
motif RRPE [4,7],
and are shown in Tab. 4.
The union of the two sets contains 76 genes. We see that genes
containing the motifs PAC and RRPE are repressed at the late
stage of the diauxic shift compared to the early stages. This result is
in agreement with the expression coherence score data available from the
web supplement to Ref. [6]:
There one can see that (1) of all known regulatory motifs, PAC and RRPE
show the highest expression coherence for the diauxic shift and (2) viceversa,
of the eight experimental conditions considered in Ref. [6],
the diauxic shift is the one in which both the PAC and RRPE motif show
the highest expression coherence score. STRE and MIG1 motifs A total of ten significant words can be associated to
the motifs STRE [9,10]
and MIG1 [11,12].
It is well known that these play an important role in glucose repression
(see e.g.[1,13]
and references therein). Most of these words show comparable scores for
the two motifs (due to their similarity) so we decided to show them
together in Tab. 5
which shows the two scores for each word. A total of 212 genes belong to
the union of all these sets. The UME6 motif Two words are associated to the known UME6 motif,
a.k.a. URS1 [14,15],
known to be a pleiotropic regulator implicated in glucose repression [16].
They are shown in Tab. 6.
The two sets do not overlap, so that a total of 56 genes are associated
to this motif. Other significant words Three words, shown in Tab. 7,
are of uncertain status: for the first one, the set S(ACTTTC) contains
only 2 genes, making the statistical significance of the result
questionable. The word CCCCTGAA scores best with the PDR motif (0.58):
given the low significance of this score, and the fact that PDR does not
seem to be relevant for any other word, this is most likely accidental.
The word should probably be considered as belonging to the STRE/MIG1
motif (the scores are STRE: 0.46, MIG1: 0.49). Finally the word GCCCCTGA
scores best with UME6 (0.55), but its expression pattern is more similar
to the STRE/MIG1 motifs (scores: STRE:0.44, MIG1: 0.46). False positives due to families of identical or nearly
identical ORF's The genome of S. cerevisiae contains a few
families of genes whose coding and upstream regions are identical
or nearly identical. Consider for example the COS1 gene (YNL336W): the
seven genes COS2-COS8 have both coding sequence and 500 kb upstream
sequence coinciding better than 80% with the COS1 sequence. Therefore if
the upstream sequence of COS1 contains over-erpresented words, they will
likely appear in all of the upstream regions. On the other hand, the
expression profiles of all the genes in the family will be the same when
measured by a microarray experiment, simply because the experimental
apparatus cannot distinguish between the mRNA produced by the various
members of the family, due to cross-hybridization between their mRNA.
Therefore all of the genes of the family are likely to occur in the sets
of the words that are overrepresented in their upstream region, and even
a small deviation from the genome-averaged expression acquires a
statistical significance. We found two instances of this in our analysis: the
words GACGTAGC and GGTCGCAC appear to be associated to significant
enhancement of the corresponding sets of genes at late timepoints in the
diauxic shift: however the two sets contain respectively seven out of
eight and all of the COS1-COS8 genes. Since the COS genes are mildly
overexpressed, this creates a false statistical significance. When one
corrects for this, by keeping only one representative of the family, the
statistical significance of the two sets disappears. A candidate new motif Finally, the word ATAAGGG/CCCTTAT is a candidate new
binding site, since it does not have good comparison scores with any of
the known motifs. It scores best with the AFT1 motif, with a 0.52 score
which is practically meaningless since 84.9% of all independent 7-letter
words score the same or better with at least one motif. It is associated
with 13 genes, as shown in Tab. 8,
which are overexpressed at late timepoints. The average expression
levels for the set and the significance index are shown as a function of
time in Figs. 3
and 4. Comparison with the results of Ref. [5] As stated in the introduction, the method proposed in
Ref.[5] also
allows one to identify regulatory motifs without any previous clustering
of gene expression data: a linear dependence of the logarithm of the
expression levels on the number of repetitions of each regulatory motifs
is postulated, and motifs are ranked according to the reduction in χ2 obtained when such dependence is
subtracted from the experimental expression levels. Iteration of the
procedure produces a model, that is a set of relevant regulatory
motifs, for each expression data set. In Ref. [5]
such a model is presented for the 14 min. time point in the α-synchronized cell-cycle experiment of Spellmann
et al., Ref. [17].
We used our algorithm on the same data set to compare the findings. Let
us concentrate on the 7-letter words (the longest considered in [5]).
We found 9 significant words, reported in Tab. 9.
Of these, five coincide with or are very similar to words found by the
authors of Ref.[5]
(see their Tab. 2).
The remaining four (AGGCTAA, GGCTAAG, GCTAAGC and CTAAGCG, whose
similarity clearly suggests the existence of a longer motif) are of
particular interest for the purpose of comparing the two methods: If one
looks at the dependence of the expression levels on the number of
occurrences of these words in the 500 bp upstream region, one clearly
sees the existence of an activation threshold (see Fig. 5,
where such dependence is shown for GGC-TAAG). On the other hand, by
looking at these data one hardly expects a significant reduction in χ2 when trying to describe this
dependence with a straight line. This should be compared to the same
dependence for the word AAAATTT, shown in Fig. 6,
which is found by both algorithms. On the other hand, there are
two 7-word motifs found in [5]
that do not pass our significativity threshold, that is CCTCGAC and
TAAACAA. We can conclude that the two methods tend to find
motifs with a different effect on gene expression: probably the best
results can be obtained by using them both on the same data set. Conclusions We have presented a new computational method to
identify regulatory motifs in eukaryotes, suitable to identify those
motifs that are effective when repeated many times in the upstream
sequence of a gene. The main feature that differentiates our method from
existing algorithms for motif discovery is the fact that genes are
grouped a priori based on similarities in their upstream
sequences. Most of the significant words the algorithm finds can
be associated to five known regulatory motifs: This fact consitutes a
strong validation of the method. Three of them (STRE, MIG1 and UME6)
were previously known to be implicated in glucose suppression, while the
fact that PAC and RRPE sites are relevant to regulation during the
diauxic shift is in agreement with expression coherence data as reported
in the web supplement to Ref. [6].
One of the significant words we find (ATAAGGG) cannot be identified with
any known motif, and is a candidate new binding site. It is easy, at least in principle, to extend the
method to a larger class of regulatory sites. According to our knowledge
of gene regulation, this should be done at least in two directions: (1)
the analysis should not be restricted to fixed sequences, but extended
to motifs with controlled variability; in particular the extension to
spaced dyads [18]
should be straightforward; (2) the combinatorial analysis of
binding sites [6]
could also be performed along the same lines, that is first grouping
genes according to which combinations of motifs appear in their upstream
region, and then analysing expression profiles within each group. 1Our set is smaller that the one
reported in Ref. [2]
because we do not allow the upstream sequence to overlap with the
previous gene: this eliminates 7 genes form the set. Acknowledgement Upstream sequences were downloaded, and many
cross-checks were performed, using the impressive collection of packages
available from the Regulatory Sequence Analysis Tools (RSAT) [19]
at http://www.ucmb.ulb.ac.be/bioinformatics/rsa-tools/
or ... http://embnet.cifn.unam.mx/rsa-tools/. Michele Caselle1
|
|
|
|
1999-2005 中国科学院上海生命科学研究院生物信息中心 |