F. Wolfertstetter et al., IDENTIFICATION OF FUNCTIONAL ELEMENTS IN UNALIGNED NUCLEIC-ACID SEQUENCES BY A NOVEL TUPLE SEARCH ALGORITHM, Computer applications in the biosciences, 12(1), 1996, pp. 71-80
We present an algorithm to identify potential functional elements like
protein binding sites in DNA sequences, solely from nucleotide sequen
ce data. Prerequisites are a set of at least seven not closely related
sequences with a common biological function which is correlated to on
e or more unknown sequence elements present in most but not necessaril
y all of the sequences. The algorithm is based on a search for n-tuple
s which occur at least in a minimum percentage of the sequences with n
o or one mismatch, which may be at any position of the tuple. In contr
ast to functional tuple, random tuples show no preferred pattern of mi
smatch locations within the tuple nor is the conservation extended bey
ond the tuple. Both features of functional tuples are used to eliminat
e random tuples. Selection is carried out by maximization of the infor
mation content first for the n-tuple, then for a region containing the
tuple and finally for the complete binding site. Further matches are
found in an additional selection step, using the ConsInd method previo
usly described. The algorithm is capable of identifying and delimiting
elements (e.g. protein binding sites) represented by single short cor
es (e.g. TATA box) in sets of unaligned sequences of about 500 nucleot
ides using no information other than the nucleotide sequences. Further
more, we show its ability to identify multiple elements in a set of co
mplete LTR sequences (more than 600 nucleotides per sequence).