新闻 | 论坛 | 生物信息学专题 | 新思路 | 软件下载 | 相关数据库 | 免费主页

网站首页 BioSino Databese BioSino Lab BioSino Navigator 关于本站

 
站内搜索:  

Database verification studies of SWISS-PROT and GenBank

 

[编者的话]

我们知道目前的数据库在某种程度上并去完美,其中的功能注释在很大程度上并不准确,如果不知道数据库在多大程度上是可信的,而一味相信其中给出的信息,那么就会犯这样一个错误:即将错误扩增、蔓延。这篇文章的作者研究了SWISS-PROT,TrEMBL和GeneBank之间的关系来研究这个问题。他们的研究目的有两个:

 

First  is to determine whether users can reliably identify those proteins in SWISS-PROT whose functions were determined experimentally, as opposed to proteins whose functions were predicted computationally. If this information was present in reasonable quantities, it would allow researchers to decrease the propagation of incorrect function predictions during sequence annotation, and to assemble training sets for developing the next generation of sequence-analysis algorithms.

Second is to assess the consistency between translated GenBank sequences and to better understand biological systems

在他们的研究结果中得到了如下结论:

(1)Contrary to claims by the SWISS-PROT authors, we conclude that SWISS-PROT does not identify a significant number of experimentally characterized proteins.

(2)SWISS-PROT is more incomplete than we expected in that version 38.0 from July 1999 lacks many proteins from the full genomes of important organisms that were sequenced years earlier.

(3)Even if we combine SWISS-PROT and TrEMBL, some sequences from the full genomes are missing from the combined dataset.

(4)In many cases, translated GenBank genes do not exactly match the corresponding SWISS-PROT sequences, for reasons that include missing or removed methionines, differing translation start positions, individual amino-acid differences, and inclusion of sequence data from multiple sequencing projects. For example, results show that for Escherichia coli, 80.6% of the proteins in the GenBank entry for the complete genome have identical sequence matches with SWISS-PROT/TrEMBL sequences, 13.4% have exact substring matches, and matches for 4.1% can be found using BLAST search; the remaining 2.0% of E.coli protein sequences (most of which are ORFs) have no clear matches to SWISS-PROT/TrEMBL. Although many of these differences can be explained by the complexity of the DB, and by the curation processes used to create it, the scale of the differences is notable.

有兴趣的朋友请见原文

 


1999-2005 中国科学院上海生命科学研究院生物信息中心  
版权所有 All rights reserved.