Gz. Hertz et Gd. Stormo, Identifying DNA and protein patterns with statistically significant alignments of multiple sequences, BIOINFORMAT, 15(7-8), 1999, pp. 563-577
Motivation: Molecular biologists frequently, can obtain interesting insight
by aligning a set of related DNA, RNA or protein sequences. Such alignment
s can be used to determine either evolutionary or functional relationships.
Our intel est is in identifying functional relationships. Unless the seque
nces are very similar; it is necessary to have a specific strategy for meas
uring-or scoring-the relatedness of the aligned sequences. if the alignment
is not known, one can be determined by finding an alignment that optimizes
the scoring scheme.
Results: We describe four components to our approach for determining alignm
ents of multiple sequences. First, we review a log-likelihood scoring schem
e we call information content. Second, bye describe two methods for estimat
ing the P value of an individual information content score: (i) a method th
at combines a technique from large-deviation statistics with numerical calc
ulations; (ii) a method that is exclusively numerical. Third, we describe h
ow we count the number of possible alignments given the overall amount of s
equence data. This count is multiplied by the P value to determine the expe
cted frequency of an information content score and thus, the statistical si
gnificance of the corresponding alignment. Statistical significance cart be
used to compare alignments having differing widths and containing differin
g numbers of sequences. Fourth, we describe a greedy algorithm for determin
ing alignments of functionally related sequences. Finally, bye test the acc
uracy of our P value calculations, and give an example of using our algorit
hm to identify binding sites for the Escherichia coli CRP protein.
Availability: Programs were developed under the UNIX operating system and a
re available by anonymous ftp from ftp://beagle.colorado.edu/pub/consensus.
Contact: hertz@colorado.edu.