Motivation: When analyzing protein sequences using sequence similarity sear
ches, orthologous sequences (that diverged by speciation) are more reliable
predictors of a new protein's function than paralogous sequences (that div
erged by gene duplication), because duplication enables functional diversif
ication. The utility of phylogenetic information in high-throughput genome
annotation ('phylogenomics') is widely recognized, but existing approaches
are either manual or indirect (e.g. not based on phylogenetic trees). Our g
oal is to automate phylogenomics using explicit phylogenetic inference. A n
ecessary component is an algorithm to infer speciation and duplication even
ts in a given gene tree.
Results: We give an algorithm to infer speciation and duplication events on
a gene tree by comparison to a trusted species tree. This algorithm has a
worst-case running time of O(n(2)) which is inferior to two previous algori
thms that are similar toO(n) for a gene tree of n sequences. However, our a
lgorithm is extremely simple, and its asymptotic worst case behavior is onl
y realized on pathological data sets. We show empirically, using 1750 gene
trees constructed from the Pfam protein family database, that it appears to
be a practical (and often superior) algorithm for analyzing real gene tree
s.