Motivation: Database seal-ch programs such as FASTA, BLAST or a rigoro
us Smith-Waterman algorithm produce lists of database entries, which a
re assumed to be related to the query. The computation of statistical
significance of similarity scores is well established for single pairs
of sequences and using purely random models. However; the multi-trial
context of a database search poses new, problems. The credibility of
a cer-tain score obtained in a database sear-ch deer-eases with the am
ount of data that is compared. To improve p-value computation for data
base sear-ch experiments, statistical properties of the databases, suc
h as the distribution of sequence length and effects induced by freque
ntly repeated sequence patterns, need to be taken into account. Result
s: We investigated the SWISS-PROT protein database Release 31.0 runnin
g extensive simulations of database searches. A discrepancy is observe
d between the theoretical predictions and the empirical distribution.
To correct for this, we evaluate the statistical significance of score
s in the conte,ut of a database sear ch by a contrasting semi-random m
odel. This model enhances purely random models by one additional param
eter reflecting individual statistical proper-ties of real databases.
We call this parameter the effective size of the database.