|
The required num of microarray replicate |
|
[编者的话] 该文原来的题目是:How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach。看看这个题目,整个文章想解决的问题就不必编者多言了,想说的是,这是一个有趣的基本问题,该问题受多种因素影响,作者试图采用一个combined的方法去解决这一问题。
Abstract Background It has been recognized that replicates of arrays (or
spots) may be necessary for reliably detecting differentially expressed
genes in microarray experiments. However, the often-asked question of
how many replicates are required has barely been addressed in the
literature. In general, the answer depends on several factors: a given
magnitude of expression change, a desired statistical power (that is,
probability) to detect it, a specified Type I error rate, and the
statistical method being used to detect the change. Here, we discuss how
to calculate the number of replicates in the context of applying a
nonparametric statistical method, the normal mixture model approach, to
detect changes in gene expression. Results The methodology is applied to a data set containing
expression levels of 1,176 genes in rats with and without pneumococcal
middle-ear infection. We illustrate how to calculate the power functions
for 2, 4, 6 and 8 replicates. Conclusions The proposed method is potentially useful in designing
microarray experiments to discover differentially expressed genes. The
same idea can be applied to other statistical methods. Background Microarrays are used to measure the (relative)
expression levels of thousands of genes (or expressed sequence tags). A
comparison of gene expression in cells or tissues from two conditions
may provide useful information on important biological processes or
functions [1,2].
The challenge now is how to detect those genuine changes from noisy
data. It is now known that simply using fold changes, as in the earlier
days, is unreliable and inefficient [3,4].
More sophisticated statistical methods are called for. Many proposals
have appeared in the literature [3,4,5,6,7,8,9,10].
In particular, it has been noticed that it may be necessary to design an
experiment that uses multiple arrays (or multiple spots on each array)
containing multiple measurements for each gene under each condition. One
reason is that because of a high noise-to-signal ratio, a single array
may not provide enough information that can be reliably extracted [11].
More important, multiple measurements from each gene make it possible to
assess the potentially different variability of genes. The problem then
seems to fall within the traditional two-sample comparison in
statistics. Two of the best known two-sample statistical tests are the
two-sample t-test and the Wilcoxon test (or equivalently,
Mann-Whitney test). The t-test is parametric and is based on the
assumption that the gene-expression levels have normal distributions. In
contrast, the Wilcoxon test is nonparametric and is based on the ranks
of observed gene-expression levels. Although the t-test is robust
to departures from normality and the Wilcoxon test does not depend on
the normality assumption, the problem is that under non-normal
situations the t-test may be too conservative, and hence, as with
the Wilcoxon test, may have too low power, especially when the sample
size is small, which is the case for most microarray experiments. These
points have been verified in two case studies using real data [8,12].
In a class of nonparametric approaches [5,9,10],
a version of the two-sample t-statistic is used but its null
distribution is estimated nonparametrically, rather than directly
assumed to be a t-distribution. In addition, some earlier studies
have suggested that the variability of gene expression may be related to
the mean expression [3,4,6].
Therefore, it implies that the t-statistic being used should be
based on unequal variances for the two samples. An important and natural question often asked by
biologists is how many replicates are required. For microarray
experiments, unlike many other experimental contexts, this issue has
rarely been discussed in the literature. To our knowledge, the only
exception is the work by Black and Doerge [13],
which, however, is for the situation where parametric statistical
methods are applied to detect expression changes. In this paper, we
discuss the problem when a nonparametric method, the normal mixture
model approach [10],
is used to detect differential expression. But to facilitate
calculations of sample size, the formulation is slightly changed from
their original one. Nonparametric methods of microarray data analysis
have been pioneered by Efron and Tibshirani and co-workers [5,9].
They take advantage of the presence of replicates and thus can impose
much weaker modeling assumptions. For instance, the parametric methods
of Black and Doerge [13]
depend on the assumption on the log-normal or gamma distribution of
gene-expression levels, whereas the mixture model approach does not have
such a distributional assumption and directly estimates distributions
related to random errors. Note that modeling the distribution of random
errors has advantages over direct modeling of expression levels, and is
a common practice in applied statistics. For example, gene-expression
levels may be correlated (for example, as a result of coexpression of
some genes) whereas random errors can be more reasonably assumed to be
independent. This is similar to modeling longitudinal data using a
linear mixed-effects model [14]:
the responses from each subject (corresponding to a group of coregulated
genes here) are in general correlated, but the measurement errors from
the same subject can be considered to be independent after incorporating
a random-subject effect in the model. Note that the random effect will
be canceled out from the t-statistic for each gene. Our proposal
here also shows an attractive feature of the mixture model approach, as
compared to the other two nonparametric approaches [5,9],
because it is still unclear how the sample size/power calculation can be
done in the other two approaches. The problem of calculating the number of replicates
required in a microarray experiment is similar to that of sample
size/power calculations in clinical trials and other experiment designs;
the (to-be-determined) sample size in microarray experiments refers to
the number of replicates, whereas the number of genes is not an issue
here. As usual, we assume that the replicates are (approximately)
independent with each other, whether they are drawn from the same
individual or multiple individuals. In general, the required sample size
depends on several factors: the true magnitude of the change of gene
expression (say, d), the desired statistical power (that is,
probability) (
The proposed method is not restricted to any specific
microarray technology. From now on, the expression level can refer to a
summary measure of relative red-to-green-channel intensities in a
fluorescence-labeled cDNA array, a radioactive intensity of a
radiolabeled cDNA array (as used in the example later), or a summary
difference of the perfect match (PM) and mismatch (MM) scores from an
oligonucleotide array. The gene-expression levels may have been suitably
preprocessed, including dimension reduction, data normalization and data
transformation [5,15,16,17,18]. Results and discussion A statistical model We consider a generic situation that, for each gene i,
I = 1,2,..., N, we have (relative) expression levels X1i,...,
Xmi from m microarrays under condition 1, and Y1i,...,
Ymi from m arrays under condition 2. We need to
assume that m is an even integer. A general statistical model is
assumed for gene expression data: Xji
=
where
for any j = 1,..., m, l = 1,..., m
and i = 1,..., N. It is assumed that random errors
A goal is to detect all genes with
A test statistic To test the null hypothesis H0:
Note that the mean and variance of Zi
are
whereas the mean E(Zi) = 0
under H0. Hence, it can be seen that a large absolute
value of Zi, |Zi|, gives evidence
against H0. As the number of arrays (that is, m)
increases, the variance of the test statistic Zi
decreases. Hence, it is possible to reject H0 (that
is, detect differential expression for gene i) with any E(Zi)
≠ 0 if m is large enough. In other words, if the
Type I error rate and other parameters are fixed, then the statistical
power of the test will increase as m increases. This is the key
point that motivates the discussion on sample size calculations. To determine the cut-off point for |Zi|
to reject H0, we need to know or estimate the
distribution of Zi under H0, the
null distribution f0. In a parametric approach, based
on some full distributional assumptions for Xji and Yji,
one may derive the null distribution f0, such as in a
two-sample t-test. However, the validity of such a parametric
method critically depends on the correctness of assumed distributions,
which of course is not guaranteed. Here, we consider a nonparametric
approach: a finite normal mixture model is used to estimate f0
nonparametrically. Estimating the null distribution There may be various ways to estimate the null
distribution f0. For instance, using expression levels
of some housekeeping genes that are known to have non-differential
expression, one can construct their Zi scores and then
estimate f0 using the obtained Zi
scores. In practice, however, there may be only a small number of or no
housekeeping genes in a given experiment. Here, following the basic idea
in a class of nonparametric methods [5,9,10],
we construct a null score zi for each gene and then
use these null scores to estimate f0
nonparametrically. The null score is constructed from the same observed
gene expression data as used in Zi:
Under the assumption that
Thus zi and Zi
have the same distribution f0 under H0.
We use all zi values across all genes to estimate f0. In practice,
We assume that all the zi values for
i = 1,..., N are a random sample from f0;
thus we can use the observed zi values to estimate f0.
Pan et al. [10]
proposed estimating f0 using a finite normal mixture
model [19].
Specifically, it is assumed that
where
A mixture model can be fitted by maximum likelihood
using the expectation-maximization (EM) algorithm [19,20,21].
The number of components can be selected adaptively using the Akaike
Information Criterion (AIC) [22]
or the Bayesian Information Criterion (BIC) [23].
In using the AIC or BIC, one first fits a series of models with various
values of g0, then picks up the g0
corresponding to the first local minimum of AIC or BIC [24].
Some empirical studies seem to favor the use of BIC [24]. Determining the cut-off point Once we obtain an estimate of the null distribution f0,
we can determine the cut-off point of the rejection region for testing H0.
In general, as for a two-sample test, the rejection region can be
selected in the tails of f0 because, under the null
hypothesis, Zi should be close to the center of f0,
whereas if there is differential expression for gene i, Zi
is likely to be in one of the two tails of f0. The
specific choice may depend on the goal of the analysis. For example, if
we are only interested in detecting upregulated genes, we can choose the
rejection region at the right-tail of f0. Our proposed
method works for any specified way of determining the rejection region.
As f0 should be symmetric about its mean 0, and often
we are interested in both up- and downregulated genes, we propose to
take the rejection region at the two tails of f0, {z
: f0(z) < C
where
For microarray data, because we are testing H0
for each gene, the multiple test problem arises and some control on it
is necessary. Usually we can use Bonferroni's method. For instance, if
we want to maintain the genome-wide Type I error rate at the usual 5%
level, then the Bonferroni-adjusted gene-specific (that is,
test-specific) Type I error rate is
Once C
is the difference of the coefficients of variation
under the two conditions. If
Unsurprisingly, we can see that
Calculation of replicate numbers Now we describe how to calculate replicate numbers
based on some pilot data taken from earlier studies. We use zm,i
to explicitly denote the zi scores in (2) with m
replicates. Based on the data we can estimate the density function f0,m
(z;Ωg0)
of zm,ivalues as a normal mixture
From now on, we treat f0,m as
known in Equation (5). With estimated f0,m, we want
to estimate the density function f0,mk for zmk,i,
the zi scores based on mk replicates (with k
> 1). If we can have an estimate of f0,mk,
then we can obtain the corresponding power function
have the distribution f0,mk.
Thus, the density function for zmk,i
values is
For example, if we triple the number of replicates,
the resulting density function is
The number of components of f0,mk
may be too large. For example, if the number of components is g0
= 3 for m = n = 2, the corresponding numbers of components
for m = n = 4, m = n = 6 and m = n
= 8 are, respectively, g02 = 9, g03
= 27 and g04 = 81. In fact, some of these
components maybe very similar or have a negligible role, hence the form
of f0,mk, may be simplified. In the extreme
situation, as mk → ∞,
by the Central Limit Theorem, the mixture model will reduce to a
single-component normal distribution. Hence, we propose a
simulation-based method to select a more parsimonious model for f0,mk. On the basis of the mixture model f0,m
in Equation (5), we can generate a random sample of zm,i(j)
values [10],
from which we can calculate zmk,i
values using Equation (6). Using zmk,i
values we can fit a normal mixture model for f0,mk.
As we shall show later, we find such a fitted mixture model often
contains a smaller number of components than gk0,
as dictated in Equation (7), leading to a simplified form of f0,mk. Summary of the proposed method In summary, our proposed method of calculating the
required replicate number works in the following steps. Step 1. Suppose that we have pilot
gene expression data Xji and Yij
from m arrays under each condition. Use formula (2) to calculate
the scores zi,m. Step 2. Use zi,m
and the normal mixture model (5) to estimate f0,m. Step 3. For a specified Type I
error rate
Step 4. For any specified d,
calculate the power function
Step 5. For any given k >
1, use formula (7) or (6) to estimate f0,mk. Step 6. For a specified Type I
error rate
Step 7. For any specified d,
calculate the power function
Step 8. Repeat Steps 5 to 7 until
all k > 1 of interest have been tried. After the power functions for many possible mk
replicates have been obtained, we can determine an appropriate number of
replicates by considering all the factors involved, the desired power
and Type I error rate, the targeted expression changes and other
experimental constraints. An example To understand the pathogenesis of otitis media, a
study was conducted to identify genes involved in response to
pneumococcal middle-ear infection and to study their roles in otitis
media. Radioactively labeled DNA microarrays were applied to the mRNA
analysis of 1,176 genes in middle-ear mucosa of rats with and without
subacute pneumococcal middle-ear infection [26].
The data are available for the control group and for the pneumococcal
middle-ear infection group. A more detailed description of how the data
were collected and their public availability was provided in Pan et
al. [26].
For the purpose of sample size calculations and to mimic many practical
situations with only a small number of replicates, we only use m
= n = 2 arrays from each group. We first take a natural logarithm
transformation for all the observed gene-expression levels (that is,
radioactive intensities) so that the resulting distributions are less
skewed (which will reduce the number of components of a fitted mixture
model). Then, for each microarray, we standardize the transformed
gene-expression levels by subtracting their median. Because of the small m = 2, the sample SDs may
not be stable. One way is to add a small constant as suggested by Efron et
al. [5].
Here we follow the idea of Lin et al. [27]
and use a loess smoother [28]
to nonparametrically model the sample SDs in terms of the mean
expression levels (Figure 1).
Then we plug in the smoothed SD to calculate z2,i.
Note that an alternative use of SD or its modification in calculating z2,i
values will not change the basic idea and the following steps in sample
size calculations. We fitted three mixture models for f0,2
with g0 ranging from 1 to 3. Table 1
summarizes the model-fitting results. g0 = 1 was
selected as both AIC and BIC achieve their minima there. So the fitted f0
is a normal distribution, N(-0.0013, 0.1278). However, for the
purposes of general illustration, we choose g0 = 2 as
the fitted model: f0,2(z)
= 0.76
Figure 2a
presents the histogram of zi values and the fitted f0
with g0 = 1 and 2. There is not much difference
between the two fitted f0,2, both of which fit the
data well. In particular, f0,2 does not look like a t-distribution
with small degrees of freedom, as predicted from the t-test. A realization of z2,i can be
simulated in the following two steps. First, we draw a random number pi
from {1, 2} with probability 0.76 and 0.24 respectively. Second, if the
drawn pi = 1, zi is randomly drawn
from a normal distribution
If we want to have only one expected false-positive
result from testing each of 1,176 non-differentially expressed genes,
the gene-specific (or test-specific) Type I error rate is
Figures 4,5,6
give the results for testing N = 1,000, 5,000 and 10,000 genes,
respectively, while controlling the genome-wide Type I error rate at the
usual 5% level. It can be seen that as N increases, we also need
a larger number of arrays to maintain the power of the statistical test
when other parameters are fixed. For instance, for N = 10,000
(Figure 6),
even eight replicates cannot detect a change as large as d = 3
with 80% power, but six replicates can detect a change d = 4 with
80% power. Conclusions We have described a method for calculating the number
of replicates in microarray experiments. This method is designed for the
situation where the mixture approach is going to be taken to analyze the
data. Note that any method for sample size/power calculations has to
depend on a specific statistical test to be used in data analysis; this
explains why there is a huge literature on the topic for clinical
trials. However, because of the close relation between the mixture
approach and the other two recently proposed nonparametric approaches -
the empirical Bayes method [5]
and the statistical analysis of microarray (SAM) method [9]
- our proposed method can be also applied to provide some useful
guideline for designing microarray experiments even when one of the
latter two approaches (or other approaches) is planned to be used for
data analysis in a later stage. For instance, even though the null
distribution f0 is estimated using the null scores zi
in our proposal, there maybe alternative ways of estimating f0,
such as using an alternative nonparametric method (for example, kernel
or local likelihood), rather than the finite normal mixture model, to
estimate f0 or using the test statistics, Zi,
of a large number of housekeeping genes to estimate f0.
Some modifications to the test statistic Zi and the
null statistic zi are also possible, especially when
we consider differential gene expression across more than two
conditions. These are all interesting topics we are investigating now. In most sample size/power calculations, some pilot
data are needed to provide reasonable estimates of some parameters
needed for subsequent calculations. An alternative is to obtain
reasonable estimates from other similar studies in the literature.
However, because of the rapid development of microarray technology, the
latter is not likely and we expect a researcher will have to do his or
her own pilot study. This was the situation we considered in the
example. A particular challenge is how to obtain good estimates of the
variances of gene expression levels from a small number of replicates.
In our example, we considered a nonparametric method to smooth sample
variances. Some alternative smoothing methods have also appeared in the
literature. But it is not clear which one is the most desirable. This is
a topic for future study. The proposed method is straightforward to
statisticians and can be implemented in many existing statistical
packages. Our sample S-Plus program and data are available at [29]. Wei Pan1
|
|
|
|
1999-2005 中国科学院上海生命科学研究院生物信息中心 |