M. Gerstein et M. Levitt, A STRUCTURAL CENSUS OF THE CURRENT POPULATION OF PROTEIN SEQUENCES, Proceedings of the National Academy of Sciences of the United Statesof America, 94(22), 1997, pp. 11911-11916
We examine the occurrence of the approximate to 300 known protein fold
s in different groups of organisms. To do this, we characterize a larg
e fraction of the currently known protein sequences (approximate to 14
0,000) in structural terms, by matching them to known structures via s
equence comparison (or by secondary-structure class prediction for tho
se without structural homologues). Overall, we find that an appreciabl
e fraction of the known folds are present in each of the major groups
of organisms (e.g., bacteria and eukaryotes share 156 of 275 folds), a
nd most of the common folds are associated with many families of nonho
mologous sequences (i.e., >10 sequence families for each common fold).
However, different groups of organisms have characteristically distin
ct distributions of folds, So, for instance, some of the most common f
olds in vertebrates, such as globins or zinc fingers, are rare or abse
nt in bacteria, Many of these differences in fold usage are biological
ly reasonable, such as the folds of metabolic enzymes being common in
bacteria and those associated with extracellular transport and communi
cation being common in animals. They also have important implications
for database-based methods for fold recognition, suggesting that an un
known sequence from a plant is more likely to have a certain fold (e.g
., a TIM barrel) than an unknown sequence from an animal.