G. Bejerano et G. Yona, Variations on probabilistic suffix trees: statistical modeling and prediction of protein families, BIOINFORMAT, 17(1), 2001, pp. 23-43
Motivation: We present a method for modeling protein families by means of p
robabilistic suffix trees (PSTs). The method is based on identifying signif
icant patterns in a set of related protein sequences. The patterns can be o
f arbitrary length, and the input sequences do not need to be aligned, nor
is delineation of domain boundaries required. The method is automatic, and
can be applied, without assuming any preliminary biological information, wi
th surprising success. Basic biological considerations such as amino acid b
ackground probabilities, and amino acids substitution probabilities can be
incorporated to improve performance.
Results: The PST can serve as a predictive tool for protein sequence classi
fication, and for detecting conserved patterns (possibly functionally or st
ructurally important) within protein sequences. The method was tested on th
e Pfam database of protein families with more than satisfactory performance
. Exhaustive evaluations show that the PST model detects much more related
sequences than pairwise methods such as Gapped-BLAST, and is almost as sens
itive as a hidden Markov model that is trained from a multiple alignment of
the input sequences, while being much faster.