What are the baselines for protein fold recognition?

Citation
Lj. Mcguffin et al., What are the baselines for protein fold recognition?, BIOINFORMAT, 17(1), 2001, pp. 63-72
Citations number
23
Categorie Soggetti
Multidisciplinary
Journal title
BIOINFORMATICS
ISSN journal
13674803 → ACNP
Volume
17
Issue
1
Year of publication
2001
Pages
63 - 72
Database
ISI
SICI code
1367-4803(200101)17:1<63:WATBFP>2.0.ZU;2-N
Abstract
Motivation: What constitutes a baseline level of success for protein fold r ecognition methods? As fold recognition benchmarks are often presented with out any thought to the results that might be expected from a purely random set of predictions, an analysis of fold recognition baselines is long overd ue. Given varying amounts of basic information about a protein-ranging from the length of the sequence to a knowledge of its secondary structure-to wh at extent can the fold be determined by intelligent guesswork? Can simple m ethods that make use of secondary structure information assign folds more a ccurately than purely random methods and could these methods be used to con struct viable hierarchical classifications? Experiments performed: A number of rapid automatic methods which score simi larities between protein domains were devised and tested. These methods ran ged from those that incorporated no secondary structure information, such a s measuring absolute differences in sequence lengths, to more complex align ments of secondary structure elements. Each method was assessed for accurac y by comparison with the Class Architecture Topology Homology (CATH) classi fication. Methods were rated against both a random baseline fold assignment method as a lower control and FSSP as an upper control. Similarity trees w ere constructed in order to evaluate the accuracy of optimum methods at pro ducing a classification of structure. Results: Using a rigorous comparison of methods with CATH, the random fold assignment method set a lower baseline of 11% true positives allowing for 3 % false positives and FSSP set an upper benchmark of 47% true positives at 3% false positives. The optimum secondary structure alignment method used h ere achieved 27% true positives at 3% false positives. Using a less rigorou s Critical Assessment of Structure Prediction (CASP)-like sensitivity measu rement the random assignment achieved 6%, FSSP-59% and the optimum secondar y structure alignment method-32%. Similarity trees produced by the optimum method illustrate that these methods cannot be used alone to produce a viab le protein structural classification system. Conclusions: Simple methods that use perfect secondary structure informatio n to assign folds cannot produce an accurate protein taxonomy, however they do provide useful baselines for fold recognition. In terms of a typical CA SP assessment our results suggest that approximately 6% of targets with fol ds in the databases could be assigned correctly by randomly guessing, and a s many as 32% could be recognised by trivial secondary structure comparison methods, given knowledge of their correct secondary structures.