|
Fitting a mixture model by expectation maximization to discover motifs in biopolymers |
|
The algorithm described in this paper discovers one or more motifs in a collection ofDNA or protein sequences by using the technique of expectation maximization to fit a two-component finite mixture model to the set of sequences. Multiple motifs are found by fitting a two-component finite mixture model to the data, probabilistically erasing the occurrences of the motif thus found, and repeating the process to find successive motifs. The algorithm requires only a set of sequences and a number specifying the width of the motifs as input. It returns a model of each motif and a threshold which together can be used as a Bayes-optimal classifier for searching for occurrences of the motif in other databases. The algorithm estimates how many times each motif occurs in the input dataset and outputs an alignment of the occurrences of the motif. The algorithm is capable of discovering several different motifs with differing numbers of occurrences in a single dataset. Motifs are discovered by treating the set of sequences as though they were created by a stochastic process which can be modelled as a mixture of two densities, one of which generated the occurrences of the motif, and the other the rest of the positions in the sequences. Expectation maximization is used to estimate the parameters of the two densities and the mixing parameter.
Fitting a mixture model by expectation maximization to discover motifs in biopolymers MEME是识别保守DNA(氨基酸)序列区域的motif的广为流行的软件,其方法的创始人Timothy Bailey和Charles Elkan早在1994年发表的这篇文章中就提出了利用最大期望的方法在一组序列中求motif的算法,他们也成为这一领域的领军人物。该算法的输入是一组序列和一个确定motif长度的整型数据,其返回的是识别到的motifs以及它们在输入数据集中出现的次数。该算法的技巧之处在于将motif出现的次数和位置的联合密度看作随机过程,利用最大期望法估计这两种密度以及他们联合密度的参数。对算法感兴趣的朋友,这篇文章值得一看,常常以MEME作为工具的朋友,了解它背后的实现机理也是大有裨益的。
|
|
|
|
1999-2005 中国科学院上海生命科学研究院生物信息中心 |