TESTS OF APPLICABILITY OF SEVERAL SUBSTITUTION MODELS FOR DNA-SEQUENCE DATA

Authors
Citation
A. Rzhetsky et M. Nei, TESTS OF APPLICABILITY OF SEVERAL SUBSTITUTION MODELS FOR DNA-SEQUENCE DATA, Molecular biology and evolution, 12(1), 1995, pp. 131-151
Citations number
24
Categorie Soggetti
Biology
ISSN journal
07374038
Volume
12
Issue
1
Year of publication
1995
Pages
131 - 151
Database
ISI
SICI code
0737-4038(1995)12:1<131:TOAOSS>2.0.ZU;2-Y
Abstract
Using linear invariants for various models of nucleotide substitution, we developed test statistics for examining the applicability of a spe cific model to a given dataset in phylogenetic inference. The models e xamined are those developed by Jukes and Cantor (1969), Kimura (1980), Tajima and Nei (1984), Hasegawa et al. (1985), Tamura (1992), Tamura and Nei (1993), and a new model called the eight-parameter model. The first six models are special cases of the last model. The test statist ics developed are independent of evolutionary time and phylogeny, alth ough the variances of the statistics contain phylogenetic information. Therefore, these statistics can be used before a phylogenetic tree is estimated. Our objective is to find the simplest model that is applic able to a given dataset, keeping in mind that a simple model usually g ives an estimate of evolutionary distance (number of nucleotide substi tutions per site) with a smaller variance than a complicated model whe n the simple model is correct. We have also developed a statistical te st of the homogeneity of nucleotide frequencies of a sample of several sequences that takes into account possible phylogenetic correlations. This test is used to examine the stationarity in time of the base fre quencies in the sample. For Hasegawa et al.'s and the eight-parameter models, analytical formulas for estimating evolutionary distances are presented. Application of the above tests to several sets of real data has shown that the assumption of stationarity of base composition is usually acceptable when the sequences studied are closely related but otherwise it is rejected. Similarly, the simple models of nucleotide s ubstitution are almost always rejected when actual genes are distantly related and/or the total number of nucleotides examined is large.