ITA
ENG

Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure

Authors

Goug, J Karplus, K Hughey, R Chothia, C

Citation

J. Goug et al., Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure, J MOL BIOL, 313(4), 2001, pp. 903-919

Citations number

Categorie Soggetti

Molecular Biology & Genetics

Journal title

JOURNAL OF MOLECULAR BIOLOGY

ISSN journal

00222836 → ACNP

Volume

313

Issue

Year of publication

2001

Pages

903 - 919

Database

ISI

SICI code

0022-2836(20011102)313:4<903:AOHTGS>2.0.ZU;2-L

Abstract

Of the sequence comparison methods, profile-based methods perform with grea ter selectively than those that use pairwise comparisons. Of the profile me thods, hidden Markov models (HMMs) are apparently the best. The first part of this paper describes calculations that (i) improve the performance of HM Ms and (ii) determine a good procedure for creating HMMs for sequences of p roteins of known structure. For a family of related proteins, more homologu es. are detected using multiple models built from diverse single seed seque nces than from one model built from a good alignment of those sequences. A new procedure is described for detecting and correcting those errors that a rise at the model-building stage of the procedure. These two improvements g reatly increase selectivity and coverage. The second part of the paper describes the construction of a library of HMM s, called SUPERFAMILY, that represent essentially all proteins of known str ucture. The sequences of the domains in proteins of known structure, that h ave identifies less than 95%, are used as seeds to build the models. Using the current data, this gives a library with 4894 models. The third part of the paper describes the use of the SUPERFAMILY model libr ary to annotate the sequences of over 50 genomes. The models match twice as many target sequences as are matched by pairwise sequence comparison metho ds. For each genome, close to half of the sequences are matched in all or i n part and, overall, the matches cover 35% of eukaryotic genomes and 45% of bacterial genomes. On average roughly 15% of genome sequences are labelled as being hypothetical yet homologous to proteins of known structure. The a nnotations derived from these matches are available from a public web serve r at: http://stash.mrc-lmb.cam.ac.uk/SUPERFAMILY. This server also enables users to match their own sequences against the SUPERFAMILY model library. ( C) 2001 Academic Press.