MetaFam: a unified classification of protein families. I. Overview and statistics

Citation
Kat. Silverstein et al., MetaFam: a unified classification of protein families. I. Overview and statistics, BIOINFORMAT, 17(3), 2001, pp. 249-261
Citations number
38
Categorie Soggetti
Multidisciplinary
Journal title
BIOINFORMATICS
ISSN journal
13674803 → ACNP
Volume
17
Issue
3
Year of publication
2001
Pages
249 - 261
Database
ISI
SICI code
1367-4803(200103)17:3<249:MAUCOP>2.0.ZU;2-Y
Abstract
Motivation: Protein sequence classification is becoming an increasingly imp ortant means of organizing the voluminous data produced by large-scale geno me sequencing projects. At present, there are several independent classific ation methods. To aid the general classification effort, we have created a unified protein family resource, MetaFam. MetaFam is a protein family class ification built upon 10 publicly-accessible protein family databases (Block s+, DOMO, Pfam, PIR-ALN, PRINTS, PROSITE, ProDom, PROTOMAP, SBASE, and SYST ERS). Metafam's family 'supersets', as we call them, are created automatica lly using set-theory to compare families among the databases. Families of o ne database are matched to those in another when the intersection of their members exceeds all other possible family pairings between the two database s. Pairwise family matches are drawn together transitively to create a new list of protein family supersets. Results: MetaFam family supersets have several useful features: (1) each su perset contains more members than the families from which it is composed, b ecause each of the component family databases only works with a subset of o ur full non-redundant set of proteins; (2) conflicting assignments can be p inpointed quickly, since our analysis identifies individual members that ar e in conflict with the majority consensus; (3) family descriptions that are absent from automated databases can frequently be assigned; (4) statistics have been computed comparing domain boundaries, family size distributions, and overall quality of MetaFam supersets; (5) the supersets have been load ed into a relational database to allow for complex queries and visualizatio n of the connections among families in a superset and the consensus of indi vidual domain members; and (6) the quality of individual supersets has been assessed using numerous quantitative measures such as family consistency, connectedness, and size. We anticipate this new resource will be particular ly useful to genomic database curators.