Estimating the probability for a protein to have a new fold: A statisticalcomputational model

Citation
E. Portugaly et M. Linial, Estimating the probability for a protein to have a new fold: A statisticalcomputational model, P NAS US, 97(10), 2000, pp. 5161-5166
Citations number
31
Categorie Soggetti
Multidisciplinary
Journal title
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
ISSN journal
00278424 → ACNP
Volume
97
Issue
10
Year of publication
2000
Pages
5161 - 5166
Database
ISI
SICI code
0027-8424(20000509)97:10<5161:ETPFAP>2.0.ZU;2-2
Abstract
Structural genomics aims to solve a large number of protein structures that represent the protein space. Currently an exhaustive solution for all stru ctures seems prohibitively expensive, so the challenge is to define a relat ively small set of proteins with new. currently unknown folds. This paper p resents a method that assigns each protein with a probability of having an unsolved fold. The method makes extensive use of PROTOMAP, a sequence-based classification, and scop, a structure-based classification. According to P ROTOMAP. the protein space encodes the relationship among proteins as a gra ph whose vertices correspond to 13,354 clusters of proteins. A representati ve fold for a cluster with at least one solved protein is determined after superposition of all SCOP (release 1.37) folds onto PROTOMAP clusters. Dist ances within the PROTOMAP graph are computed from each representative fold to the neighboring folds. The distribution of these distances is used to cr eate a statistical model for distances among those folds that are already k nown and those that have yet to be discovered. The distribution of distance s for solved/unsolved proteins is significantly different. This difference makes it possible to use Bayes' rule to derive a statistical estimate that any protein has a yet undetermined fold. Proteins that score the highest pr obability to represent a new fold constitute the target list for structural determination. Our predicted probabilities for unsolved proteins correlate very well with the proportion of new folds among recently solved structure s (new SCOP 1.39 records) that are disjoint from our original training set.