ITA
ENG

STATISTICS OF LARGE-SCALE SEQUENCE SEARCHING

Authors

SPANG R VINGRON M

Citation

R. Spang et M. Vingron, STATISTICS OF LARGE-SCALE SEQUENCE SEARCHING, BIOINFORMATICS, 14(3), 1998, pp. 279-284

Citations number

Categorie Soggetti

Computer Science Interdisciplinary Applications","Biology Miscellaneous","Computer Science Interdisciplinary Applications","Biochemical Research Methods

Journal title

BIOINFORMATICS → ACNP

ISSN journal

13674803

Volume

Issue

Year of publication

1998

Pages

279 - 284

Database

ISI

SICI code

1367-4803(1998)14:3<279:SOLSS>2.0.ZU;2-8

Abstract

Motivation: Database seal-ch programs such as FASTA, BLAST or a rigoro us Smith-Waterman algorithm produce lists of database entries, which a re assumed to be related to the query. The computation of statistical significance of similarity scores is well established for single pairs of sequences and using purely random models. However; the multi-trial context of a database search poses new, problems. The credibility of a cer-tain score obtained in a database sear-ch deer-eases with the am ount of data that is compared. To improve p-value computation for data base sear-ch experiments, statistical properties of the databases, suc h as the distribution of sequence length and effects induced by freque ntly repeated sequence patterns, need to be taken into account. Result s: We investigated the SWISS-PROT protein database Release 31.0 runnin g extensive simulations of database searches. A discrepancy is observe d between the theoretical predictions and the empirical distribution. To correct for this, we evaluate the statistical significance of score s in the conte,ut of a database sear ch by a contrasting semi-random m odel. This model enhances purely random models by one additional param eter reflecting individual statistical proper-ties of real databases. We call this parameter the effective size of the database.