Motivation: What constitutes a baseline level of success for protein fold r
ecognition methods? As fold recognition benchmarks are often presented with
out any thought to the results that might be expected from a purely random
set of predictions, an analysis of fold recognition baselines is long overd
ue. Given varying amounts of basic information about a protein-ranging from
the length of the sequence to a knowledge of its secondary structure-to wh
at extent can the fold be determined by intelligent guesswork? Can simple m
ethods that make use of secondary structure information assign folds more a
ccurately than purely random methods and could these methods be used to con
struct viable hierarchical classifications?
Experiments performed: A number of rapid automatic methods which score simi
larities between protein domains were devised and tested. These methods ran
ged from those that incorporated no secondary structure information, such a
s measuring absolute differences in sequence lengths, to more complex align
ments of secondary structure elements. Each method was assessed for accurac
y by comparison with the Class Architecture Topology Homology (CATH) classi
fication. Methods were rated against both a random baseline fold assignment
method as a lower control and FSSP as an upper control. Similarity trees w
ere constructed in order to evaluate the accuracy of optimum methods at pro
ducing a classification of structure.
Results: Using a rigorous comparison of methods with CATH, the random fold
assignment method set a lower baseline of 11% true positives allowing for 3
% false positives and FSSP set an upper benchmark of 47% true positives at
3% false positives. The optimum secondary structure alignment method used h
ere achieved 27% true positives at 3% false positives. Using a less rigorou
s Critical Assessment of Structure Prediction (CASP)-like sensitivity measu
rement the random assignment achieved 6%, FSSP-59% and the optimum secondar
y structure alignment method-32%. Similarity trees produced by the optimum
method illustrate that these methods cannot be used alone to produce a viab
le protein structural classification system.
Conclusions: Simple methods that use perfect secondary structure informatio
n to assign folds cannot produce an accurate protein taxonomy, however they
do provide useful baselines for fold recognition. In terms of a typical CA
SP assessment our results suggest that approximately 6% of targets with fol
ds in the databases could be assigned correctly by randomly guessing, and a
s many as 32% could be recognised by trivial secondary structure comparison
methods, given knowledge of their correct secondary structures.