M. Levitt et M. Gerstein, A UNIFIED STATISTICAL FRAMEWORK FOR SEQUENCE COMPARISON AND STRUCTURECOMPARISON, Proceedings of the National Academy of Sciences of the United Statesof America, 95(11), 1998, pp. 5913-5920
We present an approach for assessing the significance of sequence and
structure comparisons by using nearly identical statistical formalisms
for both sequence and structure. Doing so involves an all-vs.-all com
parison of protein domains [taken here from the Structural Classificat
ion of Proteins (scop) database] and then fitting a simple distributio
n function to the observed scores. By using this distribution, we can
attach a statistical significance to each comparison score in the form
of a P value, the probability that a better score would occur by chan
ce. As expected, we find that the scores for sequence matching follow
an extreme-value distribution. The agreement, moreover, between the P
values that we derive from this distribution and those reported by sta
ndard programs (e.g., BLAST and FASTA validates our approach. Structur
e comparison scores also follow an extreme-value distribution when the
statistics are expressed in terms of a structural alignment score (es
sentially the sum of reciprocated distances between aligned atoms minu
s gap penalties). We find that the traditional metric of structural si
milarity, the rms deviation in atom positions after fitting aligned at
oms, follows a different distribution of scores and does not perform a
s well as the structural alignment score. Comparison of the sequence a
nd structure statistics for pairs of proteins known to be related dist
antly shows that structural comparison is able to detect approximately
twice as many distant relationships as sequence comparison at the sam
e error rate, The comparison also indicates that there are very few pa
irs with significant similarity in terms of sequence but not structure
whereas many pairs have significant similarity in terms of structure
but not sequence.