The genome sciences face the challenge to characterize structure and functi
on of a vast number of novel genes. Sequence search techniques are used to
infer functional and structural information from similarities to experiment
ally characterized genes or proteins. The persistent goal is to refine thes
e techniques and to develop alternative and complementary methods to increa
se the range of reliable inference.
Here, we focus on the structural and functional assignments that can be inf
erred from the known three-dimensional structures of proteins. The study us
es all structures in the Protein Data Bank that were known by the end of 19
97. The protein structures released in 1998 were then characterized in term
s of functional and structural similarity to the previously known structure
s, yielding an estimate of the maximum amount of information on novel prote
in sequences that can be obtained from inference techniques.
The 147 globular proteins corresponding to 196 domains released in 1998 hav
e no clear sequence similarity to previously known structures. However, 75%
of the domains have extensive structure similarity to previously known fol
ds, and most importantly, in two out of three cases similarity in structure
coincides with related function. In view of this analysis, full utilizatio
n of existing structure data bases would provide information for many new t
argets even if the relationship is not accessible from sequence information
alone. Currently, the most sophisticated techniques detect of the order of
one-third of these relationships. (C) 2000 Academic Press.