The Z-value is an attempt to estimate the statistical significance of a Smi
th-Waterman dynamic alignment score (SW-score) through the use of a Monte-C
arlo process. It partly reduces the bias induced by the composition and len
gth of the sequences.
This paper is not a theoretical study on the distribution of SW-scores and
Z-values. Rather, it presents a statistical analysis of Z-values on large d
atasets of protein sequences, leading to a law of probability that the expe
rimental Z-values follow.
First, we determine the relationships between the computed Z-value, an esti
mation of its variance and the number of randomizations in the Monte-Carlo
process. Then, we illustrate that Z-values are less correlated to sequence
lengths than SW-scores.
Then we show that pairwise alignments, performed on 'quasi-real' sequences
(i.e., randomly shuffled sequences of the same length and amino acid compos
ition as the real ones) lead to Z-value distributions that statistically fi
t the extreme value distribution, more precisely the Gumbel distribution (g
lobal EVD, Extreme Value Distribution). However, for real protein sequences
, we observe an over-representation of high Z-values.
We determine first a cutoff value which separates these overestimated Z-val
ues from those which follow the global EVD. We then show that the interesti
ng part of the tail of distribution of Z-values can be approximated by anot
her EVD (i.e., an EVD which differs from the global EVD) or by a Pareto law
.
This has been confirmed for all proteins analysed so far, whether extracted
from individual genomes, or from the ensemble of five complete microbial g
enomes comprising altogether 16956 protein sequences. (C) 1999 Elsevier Sci
ence Ltd. All rights reserved.