Significance of Z-value statistics of Smith-Waterman scores for protein alignments

Citation
Jp. Comet et al., Significance of Z-value statistics of Smith-Waterman scores for protein alignments, COMPUT CHEM, 23(3-4), 1999, pp. 317-331
Citations number
31
Categorie Soggetti
Chemistry
Journal title
COMPUTERS & CHEMISTRY
ISSN journal
00978485 → ACNP
Volume
23
Issue
3-4
Year of publication
1999
Pages
317 - 331
Database
ISI
SICI code
0097-8485(1999)23:3-4<317:SOZSOS>2.0.ZU;2-5
Abstract
The Z-value is an attempt to estimate the statistical significance of a Smi th-Waterman dynamic alignment score (SW-score) through the use of a Monte-C arlo process. It partly reduces the bias induced by the composition and len gth of the sequences. This paper is not a theoretical study on the distribution of SW-scores and Z-values. Rather, it presents a statistical analysis of Z-values on large d atasets of protein sequences, leading to a law of probability that the expe rimental Z-values follow. First, we determine the relationships between the computed Z-value, an esti mation of its variance and the number of randomizations in the Monte-Carlo process. Then, we illustrate that Z-values are less correlated to sequence lengths than SW-scores. Then we show that pairwise alignments, performed on 'quasi-real' sequences (i.e., randomly shuffled sequences of the same length and amino acid compos ition as the real ones) lead to Z-value distributions that statistically fi t the extreme value distribution, more precisely the Gumbel distribution (g lobal EVD, Extreme Value Distribution). However, for real protein sequences , we observe an over-representation of high Z-values. We determine first a cutoff value which separates these overestimated Z-val ues from those which follow the global EVD. We then show that the interesti ng part of the tail of distribution of Z-values can be approximated by anot her EVD (i.e., an EVD which differs from the global EVD) or by a Pareto law . This has been confirmed for all proteins analysed so far, whether extracted from individual genomes, or from the ensemble of five complete microbial g enomes comprising altogether 16956 protein sequences. (C) 1999 Elsevier Sci ence Ltd. All rights reserved.