STATISTICS OF LARGE-SCALE SEQUENCE SEARCHING

Authors
Citation
R. Spang et M. Vingron, STATISTICS OF LARGE-SCALE SEQUENCE SEARCHING, BIOINFORMATICS, 14(3), 1998, pp. 279-284
Citations number
22
Categorie Soggetti
Computer Science Interdisciplinary Applications","Biology Miscellaneous","Computer Science Interdisciplinary Applications","Biochemical Research Methods
Journal title
ISSN journal
13674803
Volume
14
Issue
3
Year of publication
1998
Pages
279 - 284
Database
ISI
SICI code
1367-4803(1998)14:3<279:SOLSS>2.0.ZU;2-8
Abstract
Motivation: Database seal-ch programs such as FASTA, BLAST or a rigoro us Smith-Waterman algorithm produce lists of database entries, which a re assumed to be related to the query. The computation of statistical significance of similarity scores is well established for single pairs of sequences and using purely random models. However; the multi-trial context of a database search poses new, problems. The credibility of a cer-tain score obtained in a database sear-ch deer-eases with the am ount of data that is compared. To improve p-value computation for data base sear-ch experiments, statistical properties of the databases, suc h as the distribution of sequence length and effects induced by freque ntly repeated sequence patterns, need to be taken into account. Result s: We investigated the SWISS-PROT protein database Release 31.0 runnin g extensive simulations of database searches. A discrepancy is observe d between the theoretical predictions and the empirical distribution. To correct for this, we evaluate the statistical significance of score s in the conte,ut of a database sear ch by a contrasting semi-random m odel. This model enhances purely random models by one additional param eter reflecting individual statistical proper-ties of real databases. We call this parameter the effective size of the database.