The self-organizing tree algorithm (SOTA) was recently introduced to constr
uct phylogenetic trees from biological sequences, based on the principles o
f Kohonen's self-organizing maps and on Fritzke's growing cell structures.
SOTA is designed in such a way that the generation of new nodes can be stop
ped when the sequences assigned to a node are already above a certain simil
arity threshold. In this way a phylogenetic tree resolved at a high taxonom
ic level can be obtained. This capability is especially useful to classify
sets of diversified sequences. SOTA was originally designed to analyze pre-
aligned sequences. It is now adapted to be able to analyze patterns associa
ted to the frequency of residues along a sequence, such as protein dipeptid
e composition and other n-gram compositions. In this work we show that the
algorithm applied to these data is able to not only successfully construct
phylogenetic trees of protein families, such as cytochrome c, triosephophat
e isomerase, and hemoglobin alpha chains, but also classify very diversifie
d sequence data sets, such as a mixture of interleukins and their receptors
.