RSDB: representative protein sequence databases have high information content

Citation
J. Park et al., RSDB: representative protein sequence databases have high information content, BIOINFORMAT, 16(5), 2000, pp. 458-464
Citations number
38
Categorie Soggetti
Multidisciplinary
Journal title
BIOINFORMATICS
ISSN journal
13674803 → ACNP
Volume
16
Issue
5
Year of publication
2000
Pages
458 - 464
Database
ISI
SICI code
1367-4803(200005)16:5<458:RRPSDH>2.0.ZU;2-#
Abstract
Motivation: Biological sequence databases are highly redundant for two main reasons. 1. various databanks keep redundant sequences with many identical and nearl y identical sequences 2. natural sequences often have high sequence identities due to gene duplic ation. We wanted to know how many sequences call be removed before the databases s tart losing homology information. Can a database of sequences with mutual s equence identity of 50% or less provide us with the same amount of biologic al information as the original full database ? Results: Comparisons of nine representative sequence databases (RSDB) deriv ed from full protein databanks showed that the information content of seque nce databases is not linearly proportional to its size. An RSDB reduced to mutual sequence identity of around 50% (RSDB50) was equivalent to the origi nal full database irt terms of the effectiveness of homology searching. It was a third of the full database size which resulted in a six times faster iterative profile searching. The RSDBs are produced at different granularit y for efficient homology searching. Availability: All the RSDB files generated ann the full analysis results ar e available through internet: ftp://ftp.ebi.ac.uk/pub/contrib/jong/RSDB/ ht tp://cyrah.ebi. ac.uk:1111/Proj/Bio/RSDB Contact: jong@biosophy/org.