S. Whelan et N. Goldman, A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach, MOL BIOL EV, 18(5), 2001, pp. 691-699
Phylogenetic inference from amino acid sequence data uses mainly empirical
models of amino acid replacement and is therefore dependent on those models
. Two of the more widely used models, the Dayhoff and JTT models, are estim
ated using similar methods that can utilize large numbers of sequences from
many unrelated protein families but are somewhat unsatisfactory because th
ey rely on assumptions that may lead to systematic error and discard a larg
e amount of the information within the sequences. The alternative method of
maximum-likelihood estimation may utilize the information in the sequence
data more efficiently and suffers from no systematic error, but it has prev
iously been applicable to relatively few sequences related by a single phyl
ogenetic tree. Here, we combine the best attributes of these two methods us
ing an approximate maximum-likelihood method. We implemented this approach
to estimate a new model of amino acid replacement from a database of globul
ar protein sequences comprising 3,905 amino acid sequences split into 182 p
rotein families. While the new model has an overall structure similar to th
ose of other commonly used models, there are significant differences. The n
ew model outperforms the Dayhoff and JTT models with respect to maximum-lik
elihood values for a large majority of the protein families in our database
. This suggests that it provides a better overall fit to the evolutionary p
rocess in globular proteins and may lead to more accurate phylogenetic tree
estimates. Potentially, this matrix. and the methods used to generate it,
may also be useful in other areas of research, such as biological sequence
database searching, sequence alignment, and protein structure prediction, f
or which an accurate description of amino acid replacement is required.