Protein family and fold occurrence in genomes: Power-law behaviour and evolutionary model

Citation
J. Qian et al., Protein family and fold occurrence in genomes: Power-law behaviour and evolutionary model, J MOL BIOL, 313(4), 2001, pp. 673-681
Citations number
44
Categorie Soggetti
Molecular Biology & Genetics
Journal title
JOURNAL OF MOLECULAR BIOLOGY
ISSN journal
00222836 → ACNP
Volume
313
Issue
4
Year of publication
2001
Pages
673 - 681
Database
ISI
SICI code
0022-2836(20011102)313:4<673:PFAFOI>2.0.ZU;2-C
Abstract
Global surveys of genomes. measure the usage of essential molecular parts, defined here as protein families, superfamilies or folds, in different orga nisms. Based on surveys of the first 20 completely sequenced genomes, we ob serve that the occurrence of these parts follows a power-law distribution. That is, the number of distinct parts (F) with a given genomic occurrence ( V) decays as F = aV(-b), with a few parts occurring many times and most occ urring infrequently. For a given organism, the distributions of families, s uperfamilies and folds are nearly identical, and this is reflected in the s ize of the decay exponent b. Moreover, the exponent varies between differen t organisms, with those of smaller genomes displaying a steeper decay (i.e. larger b). Clearly, the power law indicates a preference to duplicate gene s that encode for molecular parts which are already common. Here, we presen t a minimal, but biologically meaningful model that accurately describes th e observed power law. Although the model performs equally well for all thre e protein classes, we focus on the occurrence of folds in preference to fam ilies and superfamilies. This is because folds are comparatively insensitiv e to the effects of point mutations that can cause a family member to diver ge beyond detectable similarity. In the model, genomes evolve through two b asic operations: (i) duplication of existing genes; (ii) net flow of new ge nes. The flow term is closely related to the exponent b and can accommodate considerable gene loss; however, we demonstrate that the observed data is reproduced best with a net inflow, i.e. with more gene gain than loss. More over, we show that prokaryotes have much higher rates of gene acquisition t han eukaryotes, probably reflecting lateral transfer. A further natural out come from our model is an estimation of the fold composition of the initial genome, which potentially relates to the common ancestor for modem organis ms. Supplementary material pertaining to this work is available from www.pa rtslist.org/powerlaw. (C) 2001 Academic Press.