EVOLUTIONARY DISTANCES BETWEEN NUCLEOTIDE-SEQUENCES BASED ON THE DISTRIBUTION OF SUBSTITUTION RATES AMONG SITES AS ESTIMATED BY PARSIMONY

Citation
Nj. Tourasse et M. Gouy, EVOLUTIONARY DISTANCES BETWEEN NUCLEOTIDE-SEQUENCES BASED ON THE DISTRIBUTION OF SUBSTITUTION RATES AMONG SITES AS ESTIMATED BY PARSIMONY, Molecular biology and evolution, 14(3), 1997, pp. 287-298
Citations number
43
Categorie Soggetti
Biology
ISSN journal
07374038
Volume
14
Issue
3
Year of publication
1997
Pages
287 - 298
Database
ISI
SICI code
0737-4038(1997)14:3<287:EDBNBO>2.0.ZU;2-7
Abstract
The rate of evolution of macromolecules such as ribosomal RNAs and pro teins varies along the molecule because structural and functional cons traints differ between sites. Many studies have shown that ignoring th is variation in computing evolutionary distances leads to severe under estimation of sequence divergences, and thus can lead to misleading ev olutionary tree inferences. We propose here a new parsimony-based meth od for computing evolutionary distances between pairs of sequences tha t takes into account this variation and estimates it from the data. Th is method applies to the number of substitutions per site in ribosomal RNA genes as well as to the number of nonsynonymous substitutions per codon for protein-coding genes and is especially suitable when large data sets (greater than or equal to 100 sequences) are analyzed. First , starting from a phylogeny constructed with usual distances, the maxi mum-parsimony method is used to infer the distribution of the number o f substitutions that have occurred at each site (or codon) along this tree. This distribution is then fitted to an ''invariant + truncated n egative binomial'' distribution that allows for invariant sites. Maxim um-likelihood fitting of this distribution to different data sets show ed that it agreed very well with real data. Noticeably, allowing for i nvariant sites seemed to be very important. Finally, two distance esti mates were developed by introducing the distribution of site variabili ty into the substitution models of Jukes and Canter and of Kimura. The use of different numbers of aligned sequences (up to 1,000 rRNA seque nces) showed that the parameters of the model are very sensitive to th e number of sequences used to estimate them. However, if at least 100 sequences are considered, the two new distance estimates are quite sta ble with respect to the number of sequences used to fit the distributi on. This stability is true for low as well as for high evolutionary di stances. These new distances appeared to be much better estimates of t he number of substitutions per site than the classical distances of Ju kes and Canter and of Kimura, which both greatly underestimate this nu mber, so that they can serve as indexes to detect saturation. We concl ude that the new distances are particularly suitable for phylogenetic analysis when very distantly related species and relatively large data sets are considered. Trees reconstructed using these distances are ge nerally different from those constructed by means of the classical est imates. Using this new method, we showed that the mean evolutionary di stance between Prokaryotes and Eukaryotes is substantially higher for the small-subunit than for the large-subunit rRNAs. This suggests than the former might have experienced a drastic change during the early e volution of Eukaryotes.