ITA
ENG

Identifying DNA and protein patterns with statistically significant alignments of multiple sequences

Authors

Hertz, GZ Stormo, GD

Citation

Gz. Hertz et Gd. Stormo, Identifying DNA and protein patterns with statistically significant alignments of multiple sequences, BIOINFORMAT, 15(7-8), 1999, pp. 563-577

Citations number

Categorie Soggetti

Multidisciplinary

Journal title

BIOINFORMATICS

ISSN journal

13674803 → ACNP

Volume

Issue

7-8

Year of publication

1999

Pages

563 - 577

Database

ISI

SICI code

1367-4803(199907/08)15:7-8<563:IDAPPW>2.0.ZU;2-J

Abstract

Motivation: Molecular biologists frequently, can obtain interesting insight by aligning a set of related DNA, RNA or protein sequences. Such alignment s can be used to determine either evolutionary or functional relationships. Our intel est is in identifying functional relationships. Unless the seque nces are very similar; it is necessary to have a specific strategy for meas uring-or scoring-the relatedness of the aligned sequences. if the alignment is not known, one can be determined by finding an alignment that optimizes the scoring scheme. Results: We describe four components to our approach for determining alignm ents of multiple sequences. First, we review a log-likelihood scoring schem e we call information content. Second, bye describe two methods for estimat ing the P value of an individual information content score: (i) a method th at combines a technique from large-deviation statistics with numerical calc ulations; (ii) a method that is exclusively numerical. Third, we describe h ow we count the number of possible alignments given the overall amount of s equence data. This count is multiplied by the P value to determine the expe cted frequency of an information content score and thus, the statistical si gnificance of the corresponding alignment. Statistical significance cart be used to compare alignments having differing widths and containing differin g numbers of sequences. Fourth, we describe a greedy algorithm for determin ing alignments of functionally related sequences. Finally, bye test the acc uracy of our P value calculations, and give an example of using our algorit hm to identify binding sites for the Escherichia coli CRP protein. Availability: Programs were developed under the UNIX operating system and a re available by anonymous ftp from ftp://beagle.colorado.edu/pub/consensus. Contact: hertz@colorado.edu.