A UNIFIED STATISTICAL FRAMEWORK FOR SEQUENCE COMPARISON AND STRUCTURECOMPARISON

Citation
M. Levitt et M. Gerstein, A UNIFIED STATISTICAL FRAMEWORK FOR SEQUENCE COMPARISON AND STRUCTURECOMPARISON, Proceedings of the National Academy of Sciences of the United Statesof America, 95(11), 1998, pp. 5913-5920
Citations number
47
Categorie Soggetti
Multidisciplinary Sciences
ISSN journal
00278424
Volume
95
Issue
11
Year of publication
1998
Pages
5913 - 5920
Database
ISI
SICI code
0027-8424(1998)95:11<5913:AUSFFS>2.0.ZU;2-A
Abstract
We present an approach for assessing the significance of sequence and structure comparisons by using nearly identical statistical formalisms for both sequence and structure. Doing so involves an all-vs.-all com parison of protein domains [taken here from the Structural Classificat ion of Proteins (scop) database] and then fitting a simple distributio n function to the observed scores. By using this distribution, we can attach a statistical significance to each comparison score in the form of a P value, the probability that a better score would occur by chan ce. As expected, we find that the scores for sequence matching follow an extreme-value distribution. The agreement, moreover, between the P values that we derive from this distribution and those reported by sta ndard programs (e.g., BLAST and FASTA validates our approach. Structur e comparison scores also follow an extreme-value distribution when the statistics are expressed in terms of a structural alignment score (es sentially the sum of reciprocated distances between aligned atoms minu s gap penalties). We find that the traditional metric of structural si milarity, the rms deviation in atom positions after fitting aligned at oms, follows a different distribution of scores and does not perform a s well as the structural alignment score. Comparison of the sequence a nd structure statistics for pairs of proteins known to be related dist antly shows that structural comparison is able to detect approximately twice as many distant relationships as sequence comparison at the sam e error rate, The comparison also indicates that there are very few pa irs with significant similarity in terms of sequence but not structure whereas many pairs have significant similarity in terms of structure but not sequence.