Motivation: Biological sequence databases are highly redundant for two main
reasons.
1. various databanks keep redundant sequences with many identical and nearl
y identical sequences
2. natural sequences often have high sequence identities due to gene duplic
ation.
We wanted to know how many sequences call be removed before the databases s
tart losing homology information. Can a database of sequences with mutual s
equence identity of 50% or less provide us with the same amount of biologic
al information as the original full database ?
Results: Comparisons of nine representative sequence databases (RSDB) deriv
ed from full protein databanks showed that the information content of seque
nce databases is not linearly proportional to its size. An RSDB reduced to
mutual sequence identity of around 50% (RSDB50) was equivalent to the origi
nal full database irt terms of the effectiveness of homology searching. It
was a third of the full database size which resulted in a six times faster
iterative profile searching. The RSDBs are produced at different granularit
y for efficient homology searching.
Availability: All the RSDB files generated ann the full analysis results ar
e available through internet: ftp://ftp.ebi.ac.uk/pub/contrib/jong/RSDB/ ht
tp://cyrah.ebi. ac.uk:1111/Proj/Bio/RSDB
Contact: jong@biosophy/org.