ITA
ENG

Systematic and fully automated identification of protein sequence patterns

Authors

Hart, RK Royyuru, AK Stolovitzky, G Califano, A

Citation

Rk. Hart et al., Systematic and fully automated identification of protein sequence patterns, J COMPUT BI, 7(3-4), 2000, pp. 585-600

Citations number

Categorie Soggetti

Biochemistry & Biophysics

Journal title

JOURNAL OF COMPUTATIONAL BIOLOGY

ISSN journal

10665277 → ACNP

Volume

Issue

3-4

Year of publication

2000

Pages

585 - 600

Database

ISI

SICI code

1066-5277(2000)7:3-4<585:SAFAIO>2.0.ZU;2-G

Abstract

We present an efficient algorithm to systematically and automatically ident ify patterns in protein sequence families. The procedure is based on the Sp lash deterministic pattern discovery algorithm and on a framework to assess the statistical significance of patterns. We demonstrate its application t o the fully automated discovery of patterns in 974 PROSITE families (the co mplete subset of PROSITE families which are defined by patterns and contain DR records), Splash generates patterns with better specificity and undimin ished sensitivity, or vice versa, in 28% of the families; identical statist ics were obtained in 48% of the families, worse statistics in 15%, and mixe d behavior in the remaining 9%, In about 75% of the cases, Splash patterns identify sequence sites that overlap more than 50% with the corresponding P ROSITE pattern, The procedure is sufficiently rapid to enable its use for d aily curation of existing moth and profile databases. Third, our results sh ow that the statistical significance of discovered patterns correlates well with their biological significance, The trypsin subfamily of serine protea ses is used to illustrate this method's ability to exhaustively discover al l motifs in a family that are statistically and biologically significant. F inally, we discuss applications of sequence patterns to multiple sequence a lignment and the training of more sensitive score-based motif models, akin to the procedure used by PSI-BLAST. All results are available at http://www .research.ibm.com/spat/.