A simple general approximation for the distribution of gapped local alignme
nt scores is presented, suitable for assessing significance of comparisons
between two protein sequences or a sequence and a profile. The approximatio
n takes account of the scoring scheme (i.e. gap penalty and substitution ma
trix or profile), sequence composition and length. Use of this formula mean
s it is unnecessary to fit an extreme-value distribution to simulations or
to the results of databank searches. The method is based on the theoretical
ideas introduced by R. Mott and R. Tribe in 1999. Extensive simulation stu
dies show that score-thresholds produced by the method are accurate to with
in +/-5% 95% of the time. We also investigate factors which effect the accu
racy of alignment statistics, and show that any method based on asymptotic
theory is limited because asymptotic behaviour is not strictly achieved for
many real protein sequences, due to extreme composition effects. Consequen
tly, it may not be practicable to find a general formula that is significan
tly more accurate until the sub-asymptotic behaviour of alignments is bette
r understood. (C) 2000 Academic Press.