There are constraints on a protein sequence/structure for it to adopt a par
ticular fold. These constraints could be either a local signature involving
particular sequences or arrangements of secondary structure or a global si
gnature involving features along the entire chain. To search systematically
for protein fold signatures, we have explored the use of Inductive Logic P
rogramming (ILP). ILP is a machine learning technique which derives rules f
rom observation and encoded principles. The derived rules are readily inter
preted in terms of concepts used by experts. For 20 populated folds in SCOP
, 59 rules were found automatically. The accuracy of these rules, which is
defined as the number of true positive plus true negative over the total nu
mber of examples, is 74% (cross-validated value). Further analysis was carr
ied out for 23 signatures covering 30 % or more positive examples of a part
icular fold. The work showed that signatures of protein folds exist, about
half of rules discovered automatically coincide with the level of fold in t
he SCOP classification. Other signatures correspond to homologous family an
d may be the consequence of a functional requirement. Examination of the ru
les shows that many correspond to established principles published in speci
fic literature. However, in general, the list of signatures is not part of
standard biological databases of protein patterns. We find that the length
of the loops makes an important contribution to the signatures, suggesting
that this is an important determinant of the identity of protein folds. Wit
h the expansion in the number of determined protein structures, stimulated
by structural genomics initiatives, there will be an increased need for aut
omated methods to extract principles of protein folding from coordinates. (
C) 2001 Academic Press.