Identifying DNA and protein patterns with statistically significant alignments of multiple sequences

Citation
Gz. Hertz et Gd. Stormo, Identifying DNA and protein patterns with statistically significant alignments of multiple sequences, BIOINFORMAT, 15(7-8), 1999, pp. 563-577
Citations number
25
Categorie Soggetti
Multidisciplinary
Journal title
BIOINFORMATICS
ISSN journal
13674803 → ACNP
Volume
15
Issue
7-8
Year of publication
1999
Pages
563 - 577
Database
ISI
SICI code
1367-4803(199907/08)15:7-8<563:IDAPPW>2.0.ZU;2-J
Abstract
Motivation: Molecular biologists frequently, can obtain interesting insight by aligning a set of related DNA, RNA or protein sequences. Such alignment s can be used to determine either evolutionary or functional relationships. Our intel est is in identifying functional relationships. Unless the seque nces are very similar; it is necessary to have a specific strategy for meas uring-or scoring-the relatedness of the aligned sequences. if the alignment is not known, one can be determined by finding an alignment that optimizes the scoring scheme. Results: We describe four components to our approach for determining alignm ents of multiple sequences. First, we review a log-likelihood scoring schem e we call information content. Second, bye describe two methods for estimat ing the P value of an individual information content score: (i) a method th at combines a technique from large-deviation statistics with numerical calc ulations; (ii) a method that is exclusively numerical. Third, we describe h ow we count the number of possible alignments given the overall amount of s equence data. This count is multiplied by the P value to determine the expe cted frequency of an information content score and thus, the statistical si gnificance of the corresponding alignment. Statistical significance cart be used to compare alignments having differing widths and containing differin g numbers of sequences. Fourth, we describe a greedy algorithm for determin ing alignments of functionally related sequences. Finally, bye test the acc uracy of our P value calculations, and give an example of using our algorit hm to identify binding sites for the Escherichia coli CRP protein. Availability: Programs were developed under the UNIX operating system and a re available by anonymous ftp from ftp://beagle.colorado.edu/pub/consensus. Contact: hertz@colorado.edu.