We performed a systematic analysis of gene upstream regions in the yeast ge
nome for occurrences of regular expression-type patterns with the goal of i
dentifying potential regulatory elements. To achieve this goal, we have dev
eloped a new sequence pattern discovery algorithm that searches exhaustivel
y for a priori unknown regular expression-type patterns that are over-repre
sented in a given set of sequences. We applied the algorithm in two cases,
(1) discovery of patterns in the complete set of >6000 sequences taken upst
ream of the putative yeast genes and (2) discovery of patterns in the regio
ns upstream of the genes with similar expression profiles. In the first cas
e, we looked for patterns that occur more frequently in the gene upstream r
egions than in the genome overall. In the second case, first we clustered t
he upstream regions of all the genes by similarity of their expression prof
iles on the basis of publicly available gene expression data and then looke
d for sequence patterns that are over-represented in each cluster. In both
cases we considered each pattern that occurred at least in some minimum num
ber of sequences, and rated them on the basis of their over-representation.
Among the highest rating patterns, most have matches to substrings in know
n yeast transcription Factor-binding sites. Moreover, several of them are k
nown to be relevant to the expression of the genes from the respective clus
ters. Experiments on simulated data show that the majority of the discovere
d patterns are not expected to occur by chance.