IDENTIFICATION OF FUNCTIONAL ELEMENTS IN UNALIGNED NUCLEIC-ACID SEQUENCES BY A NOVEL TUPLE SEARCH ALGORITHM

Citation
F. Wolfertstetter et al., IDENTIFICATION OF FUNCTIONAL ELEMENTS IN UNALIGNED NUCLEIC-ACID SEQUENCES BY A NOVEL TUPLE SEARCH ALGORITHM, Computer applications in the biosciences, 12(1), 1996, pp. 71-80
Citations number
18
Categorie Soggetti
Mathematical Methods, Biology & Medicine","Computer Sciences, Special Topics","Computer Science Interdisciplinary Applications","Biology Miscellaneous
ISSN journal
02667061
Volume
12
Issue
1
Year of publication
1996
Pages
71 - 80
Database
ISI
SICI code
0266-7061(1996)12:1<71:IOFEIU>2.0.ZU;2-M
Abstract
We present an algorithm to identify potential functional elements like protein binding sites in DNA sequences, solely from nucleotide sequen ce data. Prerequisites are a set of at least seven not closely related sequences with a common biological function which is correlated to on e or more unknown sequence elements present in most but not necessaril y all of the sequences. The algorithm is based on a search for n-tuple s which occur at least in a minimum percentage of the sequences with n o or one mismatch, which may be at any position of the tuple. In contr ast to functional tuple, random tuples show no preferred pattern of mi smatch locations within the tuple nor is the conservation extended bey ond the tuple. Both features of functional tuples are used to eliminat e random tuples. Selection is carried out by maximization of the infor mation content first for the n-tuple, then for a region containing the tuple and finally for the complete binding site. Further matches are found in an additional selection step, using the ConsInd method previo usly described. The algorithm is capable of identifying and delimiting elements (e.g. protein binding sites) represented by single short cor es (e.g. TATA box) in sets of unaligned sequences of about 500 nucleot ides using no information other than the nucleotide sequences. Further more, we show its ability to identify multiple elements in a set of co mplete LTR sequences (more than 600 nucleotides per sequence).