Databases of multiple sequence alignments are a valuable aid to protei
n sequence classification and analysis, One of the main challenges whe
n constructing such a database is to simultaneously satisfy the confli
cting demands of completeness on the one hand and quality of alignment
and domain definitions on the other, The latter properties are best d
ealt with by manual approaches, whereas completeness in practice is on
ly amenable to automatic methods, Herein we present a database based o
n hidden Markov model profiles (HMMs), which combines high quality and
completeness, Our database, Pfam, consists of parts A and B, Pfam-A i
s curated and contains well-characterized protein domain families with
high quality alignments, which are maintained by using manually check
ed seed alignments and HMMs to find and align all members, Pfam-B cont
ains sequence families that were generated automatically by applying t
he Domainer algorithm to cluster and align the remaining protein seque
nces after removal of Pfam-A domains. By using Pfam, a large number of
previously unannotated proteins from the Caenorhabditis elegans genom
e project were classified, We have also identified many novel family m
emberships in known proteins, including new kazal, Fibronectin type II
I, and response regulator receiver domains, Pfam-A families have perma
nent accession numbers and form a library of HMMs available for search
ing and automatic annotation of new protein sequences. (C) 1997 Wiley-
Liss, Inc.