A. Rzhetsky et M. Nei, TESTS OF APPLICABILITY OF SEVERAL SUBSTITUTION MODELS FOR DNA-SEQUENCE DATA, Molecular biology and evolution, 12(1), 1995, pp. 131-151
Using linear invariants for various models of nucleotide substitution,
we developed test statistics for examining the applicability of a spe
cific model to a given dataset in phylogenetic inference. The models e
xamined are those developed by Jukes and Cantor (1969), Kimura (1980),
Tajima and Nei (1984), Hasegawa et al. (1985), Tamura (1992), Tamura
and Nei (1993), and a new model called the eight-parameter model. The
first six models are special cases of the last model. The test statist
ics developed are independent of evolutionary time and phylogeny, alth
ough the variances of the statistics contain phylogenetic information.
Therefore, these statistics can be used before a phylogenetic tree is
estimated. Our objective is to find the simplest model that is applic
able to a given dataset, keeping in mind that a simple model usually g
ives an estimate of evolutionary distance (number of nucleotide substi
tutions per site) with a smaller variance than a complicated model whe
n the simple model is correct. We have also developed a statistical te
st of the homogeneity of nucleotide frequencies of a sample of several
sequences that takes into account possible phylogenetic correlations.
This test is used to examine the stationarity in time of the base fre
quencies in the sample. For Hasegawa et al.'s and the eight-parameter
models, analytical formulas for estimating evolutionary distances are
presented. Application of the above tests to several sets of real data
has shown that the assumption of stationarity of base composition is
usually acceptable when the sequences studied are closely related but
otherwise it is rejected. Similarly, the simple models of nucleotide s
ubstitution are almost always rejected when actual genes are distantly
related and/or the total number of nucleotides examined is large.