Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis

Citation
Hj. Bussemaker et al., Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis, P NAS US, 97(18), 2000, pp. 10096-10100
Citations number
23
Categorie Soggetti
Multidisciplinary
Journal title
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
ISSN journal
00278424 → ACNP
Volume
97
Issue
18
Year of publication
2000
Pages
10096 - 10100
Database
ISI
SICI code
0027-8424(20000829)97:18<10096:BADFGI>2.0.ZU;2-4
Abstract
The availability of complete genome sequences and mRNA expression data for all genes creates new opportunities and challenges for identifying DNA sequ ence motifs that control gene expression. An algorithm, "MobyDick," is pres ented that decomposes a set of DNA sequences into the most probable diction ary of motifs or words. This method is applicable to any set of DNA sequenc es: for example, all upstream regions in a genome or all genes expressed un der certain conditions. Identification of words is based on a probabilistic segmentation model in which the significance of longer words is deduced fr om the frequency of shorter ones of various lengths, eliminating the need f or a separate set of reference data to define probabilities. We have built a dictionary with 1,200 words for the 6,000 upstream regulatory regions in the yeast genome; the 500 most significant words (some with as few as 10 co pies in all of the upstream regions) match 114 of 443 experimentally determ ined sites (a significance level of 18 standard deviations). When analyzing all of the genes up-regulated during sporulation as a group, we find many motifs in addition to the few previously identified by analyzing the subclu sters individually to the expression subclusters. Applying MobyDick to the genes derepressed when the general repressor Tup1 is deleted, we find known as well as putative binding sites for its regulatory partners.