A sensitive protein-fold recognition procedure was developed on the basis o
f iterative database search using the PSI-BLAST program. A collection of 11
93 position-dependent weight matrices that can be used as fold identifiers
was produced. In the completely sequenced genomes, folds could be automatic
ally identified for 20%-30% of the proteins, with 3%-6% more detectable by
additional analysis of conserved motifs. The distribution of the most commo
n folds is very similar in bacteria and archaea but distinct in eukaryotes.
Within the bacteria, this distribution differs between parasitic and free-
living species. In all analyzed genomes, the P-loop NTPases are the most ab
undant fold. In bacteria and archaea, the next most common folds are ferred
oxin-like domains, TIM-barrels, and methyltransferases, whereas in eukaryot
es, the second to fourth places belong to protein kinases, beta-propellers
and TIM-barrels. The observed diversity of protein folds in different prote
omes is approximately twice as high as it would be expected from a simple s
tochastic model describing a proteome as a finite sample from an infinite p
ool of proteins with an exponential distribution of the fold fractions. Dis
tribution of the number of domains with different folds in one protein fits
the geometric model, which is compatible with the evolution of multidomain
proteins by random combination of domains.