ITA
ENG

IDENTIFICATION OF FUNCTIONAL ELEMENTS IN UNALIGNED NUCLEIC-ACID SEQUENCES BY A NOVEL TUPLE SEARCH ALGORITHM

Authors

WOLFERTSTETTER F FRECH K HERRMANN G WERNER T

Citation

F. Wolfertstetter et al., IDENTIFICATION OF FUNCTIONAL ELEMENTS IN UNALIGNED NUCLEIC-ACID SEQUENCES BY A NOVEL TUPLE SEARCH ALGORITHM, Computer applications in the biosciences, 12(1), 1996, pp. 71-80

Citations number

Categorie Soggetti

Mathematical Methods, Biology & Medicine","Computer Sciences, Special Topics","Computer Science Interdisciplinary Applications","Biology Miscellaneous

Journal title

Computer applications in the biosciences → ACNP

ISSN journal

02667061

Volume

Issue

Year of publication

1996

Pages

71 - 80

Database

ISI

SICI code

0266-7061(1996)12:1<71:IOFEIU>2.0.ZU;2-M

Abstract

We present an algorithm to identify potential functional elements like protein binding sites in DNA sequences, solely from nucleotide sequen ce data. Prerequisites are a set of at least seven not closely related sequences with a common biological function which is correlated to on e or more unknown sequence elements present in most but not necessaril y all of the sequences. The algorithm is based on a search for n-tuple s which occur at least in a minimum percentage of the sequences with n o or one mismatch, which may be at any position of the tuple. In contr ast to functional tuple, random tuples show no preferred pattern of mi smatch locations within the tuple nor is the conservation extended bey ond the tuple. Both features of functional tuples are used to eliminat e random tuples. Selection is carried out by maximization of the infor mation content first for the n-tuple, then for a region containing the tuple and finally for the complete binding site. Further matches are found in an additional selection step, using the ConsInd method previo usly described. The algorithm is capable of identifying and delimiting elements (e.g. protein binding sites) represented by single short cor es (e.g. TATA box) in sets of unaligned sequences of about 500 nucleot ides using no information other than the nucleotide sequences. Further more, we show its ability to identify multiple elements in a set of co mplete LTR sequences (more than 600 nucleotides per sequence).