Motivation: Protein sequence classification is becoming an increasingly imp
ortant means of organizing the voluminous data produced by large-scale geno
me sequencing projects. At present, there are several independent classific
ation methods. To aid the general classification effort, we have created a
unified protein family resource, MetaFam. MetaFam is a protein family class
ification built upon 10 publicly-accessible protein family databases (Block
s+, DOMO, Pfam, PIR-ALN, PRINTS, PROSITE, ProDom, PROTOMAP, SBASE, and SYST
ERS). Metafam's family 'supersets', as we call them, are created automatica
lly using set-theory to compare families among the databases. Families of o
ne database are matched to those in another when the intersection of their
members exceeds all other possible family pairings between the two database
s. Pairwise family matches are drawn together transitively to create a new
list of protein family supersets.
Results: MetaFam family supersets have several useful features: (1) each su
perset contains more members than the families from which it is composed, b
ecause each of the component family databases only works with a subset of o
ur full non-redundant set of proteins; (2) conflicting assignments can be p
inpointed quickly, since our analysis identifies individual members that ar
e in conflict with the majority consensus; (3) family descriptions that are
absent from automated databases can frequently be assigned; (4) statistics
have been computed comparing domain boundaries, family size distributions,
and overall quality of MetaFam supersets; (5) the supersets have been load
ed into a relational database to allow for complex queries and visualizatio
n of the connections among families in a superset and the consensus of indi
vidual domain members; and (6) the quality of individual supersets has been
assessed using numerous quantitative measures such as family consistency,
connectedness, and size. We anticipate this new resource will be particular
ly useful to genomic database curators.