ITA
ENG

Variations on probabilistic suffix trees: statistical modeling and prediction of protein families

Authors

Bejerano, G Yona, G

Citation

G. Bejerano et G. Yona, Variations on probabilistic suffix trees: statistical modeling and prediction of protein families, BIOINFORMAT, 17(1), 2001, pp. 23-43

Citations number

Categorie Soggetti

Multidisciplinary

Journal title

BIOINFORMATICS

ISSN journal

13674803 → ACNP

Volume

Issue

Year of publication

2001

Pages

23 - 43

Database

ISI

SICI code

1367-4803(200101)17:1<23:VOPSTS>2.0.ZU;2-Z

Abstract

Motivation: We present a method for modeling protein families by means of p robabilistic suffix trees (PSTs). The method is based on identifying signif icant patterns in a set of related protein sequences. The patterns can be o f arbitrary length, and the input sequences do not need to be aligned, nor is delineation of domain boundaries required. The method is automatic, and can be applied, without assuming any preliminary biological information, wi th surprising success. Basic biological considerations such as amino acid b ackground probabilities, and amino acids substitution probabilities can be incorporated to improve performance. Results: The PST can serve as a predictive tool for protein sequence classi fication, and for detecting conserved patterns (possibly functionally or st ructurally important) within protein sequences. The method was tested on th e Pfam database of protein families with more than satisfactory performance . Exhaustive evaluations show that the PST model detects much more related sequences than pairwise methods such as Gapped-BLAST, and is almost as sens itive as a hidden Markov model that is trained from a multiple alignment of the input sequences, while being much faster.