Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure

Citation
J. Goug et al., Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure, J MOL BIOL, 313(4), 2001, pp. 903-919
Citations number
31
Categorie Soggetti
Molecular Biology & Genetics
Journal title
JOURNAL OF MOLECULAR BIOLOGY
ISSN journal
00222836 → ACNP
Volume
313
Issue
4
Year of publication
2001
Pages
903 - 919
Database
ISI
SICI code
0022-2836(20011102)313:4<903:AOHTGS>2.0.ZU;2-L
Abstract
Of the sequence comparison methods, profile-based methods perform with grea ter selectively than those that use pairwise comparisons. Of the profile me thods, hidden Markov models (HMMs) are apparently the best. The first part of this paper describes calculations that (i) improve the performance of HM Ms and (ii) determine a good procedure for creating HMMs for sequences of p roteins of known structure. For a family of related proteins, more homologu es. are detected using multiple models built from diverse single seed seque nces than from one model built from a good alignment of those sequences. A new procedure is described for detecting and correcting those errors that a rise at the model-building stage of the procedure. These two improvements g reatly increase selectivity and coverage. The second part of the paper describes the construction of a library of HMM s, called SUPERFAMILY, that represent essentially all proteins of known str ucture. The sequences of the domains in proteins of known structure, that h ave identifies less than 95%, are used as seeds to build the models. Using the current data, this gives a library with 4894 models. The third part of the paper describes the use of the SUPERFAMILY model libr ary to annotate the sequences of over 50 genomes. The models match twice as many target sequences as are matched by pairwise sequence comparison metho ds. For each genome, close to half of the sequences are matched in all or i n part and, overall, the matches cover 35% of eukaryotic genomes and 45% of bacterial genomes. On average roughly 15% of genome sequences are labelled as being hypothetical yet homologous to proteins of known structure. The a nnotations derived from these matches are available from a public web serve r at: http://stash.mrc-lmb.cam.ac.uk/SUPERFAMILY. This server also enables users to match their own sequences against the SUPERFAMILY model library. ( C) 2001 Academic Press.