Nj. Tourasse et M. Gouy, EVOLUTIONARY DISTANCES BETWEEN NUCLEOTIDE-SEQUENCES BASED ON THE DISTRIBUTION OF SUBSTITUTION RATES AMONG SITES AS ESTIMATED BY PARSIMONY, Molecular biology and evolution, 14(3), 1997, pp. 287-298
The rate of evolution of macromolecules such as ribosomal RNAs and pro
teins varies along the molecule because structural and functional cons
traints differ between sites. Many studies have shown that ignoring th
is variation in computing evolutionary distances leads to severe under
estimation of sequence divergences, and thus can lead to misleading ev
olutionary tree inferences. We propose here a new parsimony-based meth
od for computing evolutionary distances between pairs of sequences tha
t takes into account this variation and estimates it from the data. Th
is method applies to the number of substitutions per site in ribosomal
RNA genes as well as to the number of nonsynonymous substitutions per
codon for protein-coding genes and is especially suitable when large
data sets (greater than or equal to 100 sequences) are analyzed. First
, starting from a phylogeny constructed with usual distances, the maxi
mum-parsimony method is used to infer the distribution of the number o
f substitutions that have occurred at each site (or codon) along this
tree. This distribution is then fitted to an ''invariant + truncated n
egative binomial'' distribution that allows for invariant sites. Maxim
um-likelihood fitting of this distribution to different data sets show
ed that it agreed very well with real data. Noticeably, allowing for i
nvariant sites seemed to be very important. Finally, two distance esti
mates were developed by introducing the distribution of site variabili
ty into the substitution models of Jukes and Canter and of Kimura. The
use of different numbers of aligned sequences (up to 1,000 rRNA seque
nces) showed that the parameters of the model are very sensitive to th
e number of sequences used to estimate them. However, if at least 100
sequences are considered, the two new distance estimates are quite sta
ble with respect to the number of sequences used to fit the distributi
on. This stability is true for low as well as for high evolutionary di
stances. These new distances appeared to be much better estimates of t
he number of substitutions per site than the classical distances of Ju
kes and Canter and of Kimura, which both greatly underestimate this nu
mber, so that they can serve as indexes to detect saturation. We concl
ude that the new distances are particularly suitable for phylogenetic
analysis when very distantly related species and relatively large data
sets are considered. Trees reconstructed using these distances are ge
nerally different from those constructed by means of the classical est
imates. Using this new method, we showed that the mean evolutionary di
stance between Prokaryotes and Eukaryotes is substantially higher for
the small-subunit than for the large-subunit rRNAs. This suggests than
the former might have experienced a drastic change during the early e
volution of Eukaryotes.