ITA
ENG

MetaFam: a unified classification of protein families. I. Overview and statistics

Authors

Silverstein, KAT Shoop, E Johnson, JE Retzel, EF

Citation

Kat. Silverstein et al., MetaFam: a unified classification of protein families. I. Overview and statistics, BIOINFORMAT, 17(3), 2001, pp. 249-261

Citations number

Categorie Soggetti

Multidisciplinary

Journal title

BIOINFORMATICS

ISSN journal

13674803 → ACNP

Volume

Issue

Year of publication

2001

Pages

249 - 261

Database

ISI

SICI code

1367-4803(200103)17:3<249:MAUCOP>2.0.ZU;2-Y

Abstract

Motivation: Protein sequence classification is becoming an increasingly imp ortant means of organizing the voluminous data produced by large-scale geno me sequencing projects. At present, there are several independent classific ation methods. To aid the general classification effort, we have created a unified protein family resource, MetaFam. MetaFam is a protein family class ification built upon 10 publicly-accessible protein family databases (Block s+, DOMO, Pfam, PIR-ALN, PRINTS, PROSITE, ProDom, PROTOMAP, SBASE, and SYST ERS). Metafam's family 'supersets', as we call them, are created automatica lly using set-theory to compare families among the databases. Families of o ne database are matched to those in another when the intersection of their members exceeds all other possible family pairings between the two database s. Pairwise family matches are drawn together transitively to create a new list of protein family supersets. Results: MetaFam family supersets have several useful features: (1) each su perset contains more members than the families from which it is composed, b ecause each of the component family databases only works with a subset of o ur full non-redundant set of proteins; (2) conflicting assignments can be p inpointed quickly, since our analysis identifies individual members that ar e in conflict with the majority consensus; (3) family descriptions that are absent from automated databases can frequently be assigned; (4) statistics have been computed comparing domain boundaries, family size distributions, and overall quality of MetaFam supersets; (5) the supersets have been load ed into a relational database to allow for complex queries and visualizatio n of the connections among families in a superset and the consensus of indi vidual domain members; and (6) the quality of individual supersets has been assessed using numerous quantitative measures such as family consistency, connectedness, and size. We anticipate this new resource will be particular ly useful to genomic database curators.