Variations on probabilistic suffix trees: statistical modeling and prediction of protein families

Citation
G. Bejerano et G. Yona, Variations on probabilistic suffix trees: statistical modeling and prediction of protein families, BIOINFORMAT, 17(1), 2001, pp. 23-43
Citations number
46
Categorie Soggetti
Multidisciplinary
Journal title
BIOINFORMATICS
ISSN journal
13674803 → ACNP
Volume
17
Issue
1
Year of publication
2001
Pages
23 - 43
Database
ISI
SICI code
1367-4803(200101)17:1<23:VOPSTS>2.0.ZU;2-Z
Abstract
Motivation: We present a method for modeling protein families by means of p robabilistic suffix trees (PSTs). The method is based on identifying signif icant patterns in a set of related protein sequences. The patterns can be o f arbitrary length, and the input sequences do not need to be aligned, nor is delineation of domain boundaries required. The method is automatic, and can be applied, without assuming any preliminary biological information, wi th surprising success. Basic biological considerations such as amino acid b ackground probabilities, and amino acids substitution probabilities can be incorporated to improve performance. Results: The PST can serve as a predictive tool for protein sequence classi fication, and for detecting conserved patterns (possibly functionally or st ructurally important) within protein sequences. The method was tested on th e Pfam database of protein families with more than satisfactory performance . Exhaustive evaluations show that the PST model detects much more related sequences than pairwise methods such as Gapped-BLAST, and is almost as sens itive as a hidden Markov model that is trained from a multiple alignment of the input sequences, while being much faster.