J. Goug et al., Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure, J MOL BIOL, 313(4), 2001, pp. 903-919
Of the sequence comparison methods, profile-based methods perform with grea
ter selectively than those that use pairwise comparisons. Of the profile me
thods, hidden Markov models (HMMs) are apparently the best. The first part
of this paper describes calculations that (i) improve the performance of HM
Ms and (ii) determine a good procedure for creating HMMs for sequences of p
roteins of known structure. For a family of related proteins, more homologu
es. are detected using multiple models built from diverse single seed seque
nces than from one model built from a good alignment of those sequences. A
new procedure is described for detecting and correcting those errors that a
rise at the model-building stage of the procedure. These two improvements g
reatly increase selectivity and coverage.
The second part of the paper describes the construction of a library of HMM
s, called SUPERFAMILY, that represent essentially all proteins of known str
ucture. The sequences of the domains in proteins of known structure, that h
ave identifies less than 95%, are used as seeds to build the models. Using
the current data, this gives a library with 4894 models.
The third part of the paper describes the use of the SUPERFAMILY model libr
ary to annotate the sequences of over 50 genomes. The models match twice as
many target sequences as are matched by pairwise sequence comparison metho
ds. For each genome, close to half of the sequences are matched in all or i
n part and, overall, the matches cover 35% of eukaryotic genomes and 45% of
bacterial genomes. On average roughly 15% of genome sequences are labelled
as being hypothetical yet homologous to proteins of known structure. The a
nnotations derived from these matches are available from a public web serve
r at: http://stash.mrc-lmb.cam.ac.uk/SUPERFAMILY. This server also enables
users to match their own sequences against the SUPERFAMILY model library. (
C) 2001 Academic Press.