E. Portugaly et M. Linial, Estimating the probability for a protein to have a new fold: A statisticalcomputational model, P NAS US, 97(10), 2000, pp. 5161-5166
Citations number
31
Categorie Soggetti
Multidisciplinary
Journal title
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
Structural genomics aims to solve a large number of protein structures that
represent the protein space. Currently an exhaustive solution for all stru
ctures seems prohibitively expensive, so the challenge is to define a relat
ively small set of proteins with new. currently unknown folds. This paper p
resents a method that assigns each protein with a probability of having an
unsolved fold. The method makes extensive use of PROTOMAP, a sequence-based
classification, and scop, a structure-based classification. According to P
ROTOMAP. the protein space encodes the relationship among proteins as a gra
ph whose vertices correspond to 13,354 clusters of proteins. A representati
ve fold for a cluster with at least one solved protein is determined after
superposition of all SCOP (release 1.37) folds onto PROTOMAP clusters. Dist
ances within the PROTOMAP graph are computed from each representative fold
to the neighboring folds. The distribution of these distances is used to cr
eate a statistical model for distances among those folds that are already k
nown and those that have yet to be discovered. The distribution of distance
s for solved/unsolved proteins is significantly different. This difference
makes it possible to use Bayes' rule to derive a statistical estimate that
any protein has a yet undetermined fold. Proteins that score the highest pr
obability to represent a new fold constitute the target list for structural
determination. Our predicted probabilities for unsolved proteins correlate
very well with the proportion of new folds among recently solved structure
s (new SCOP 1.39 records) that are disjoint from our original training set.