A. Rzhetsky et T. Sitnikova, WHEN IS IT SAFE TO USE AN OVERSIMPLIFIED SUBSTITUTION MODEL IN TREE-MAKING, Molecular biology and evolution, 13(9), 1996, pp. 1255-1265
The choice of an ''optimal'' mathematical model for computing evolutio
nary distances from real sequences is not currently supported by easy-
to-use software applicable to large data sets, and an investigator fre
quently selects one of the simplest models available. Here we study pr
operties of the observed proportion of differences (p-distance) betwee
n sequences as an estimator of evolutionary distance for tree-making.
We show that p-distances allow for consistent tree-making with any of
the popular methods working with evolutionary distances if evolution o
f sequences obeys a ''molecular clock'' (more precisely, if it follows
a stationary time-reversible Markov model of nucleotide substitution)
. Next, we show that p-distances seem to be efficient in recovering th
e correct tree topology under a ''molecular clock,'' but produce ''sta
tistically supported'' wrong trees when substitution rates vary among
evolutionary Lineages. Finally, we outline a practical approach for se
lecting an ''optimal'' model of nucleotide substitution in a real data
analysis, and obtain a crude estimate of a ''prior'' distribution of
the expected tree branch lengths under the Jukes-Cantor model. We conc
lude that the use of a model that is obviously oversimplified is inadv
isable unless it is justified by a preliminary analysis of the real se
quences.