ITA
ENG

RSDB: representative protein sequence databases have high information content

Authors

Park, J Holm, L Heger, A Chothia, C

Citation

J. Park et al., RSDB: representative protein sequence databases have high information content, BIOINFORMAT, 16(5), 2000, pp. 458-464

Citations number

Categorie Soggetti

Multidisciplinary

Journal title

BIOINFORMATICS

ISSN journal

13674803 → ACNP

Volume

Issue

Year of publication

2000

Pages

458 - 464

Database

ISI

SICI code

1367-4803(200005)16:5<458:RRPSDH>2.0.ZU;2-#

Abstract

Motivation: Biological sequence databases are highly redundant for two main reasons. 1. various databanks keep redundant sequences with many identical and nearl y identical sequences 2. natural sequences often have high sequence identities due to gene duplic ation. We wanted to know how many sequences call be removed before the databases s tart losing homology information. Can a database of sequences with mutual s equence identity of 50% or less provide us with the same amount of biologic al information as the original full database ? Results: Comparisons of nine representative sequence databases (RSDB) deriv ed from full protein databanks showed that the information content of seque nce databases is not linearly proportional to its size. An RSDB reduced to mutual sequence identity of around 50% (RSDB50) was equivalent to the origi nal full database irt terms of the effectiveness of homology searching. It was a third of the full database size which resulted in a six times faster iterative profile searching. The RSDBs are produced at different granularit y for efficient homology searching. Availability: All the RSDB files generated ann the full analysis results ar e available through internet: ftp://ftp.ebi.ac.uk/pub/contrib/jong/RSDB/ ht tp://cyrah.ebi. ac.uk:1111/Proj/Bio/RSDB Contact: jong@biosophy/org.