Clustering is one of the main mathematical challenges in large-scale gene e
xpression analysis. We describe a clustering procedure based on a sequentia
l k-means algorithm with additional I refinements that is able to handle hi
gh-throughput data in the order of hundreds of thousands of data items meas
ured on hundreds of variables. The practical motivation for our algorithm i
s oligonucleotide fingerprinting-a method for simultaneous determination of
expression level for every active gene of a specific tissue-although the a
lgorithm can be applied as well Co other large-scale projects like EST clus
tering and qualitative clustering of DNA-chip data. As a pairwise similarit
y measure between two p-dimensional data points, x and y, we introduce mutu
al information that can be interpreted as the amount of information about x
in y,and vice versa. We show that For our purposes this measure is superio
r to commonly used metric distances, for example, Euclidean distance. We al
so introduce a modified version of mutual information as a novel method for
validating clustering results when the true clustering is known. The perfo
rmance of our algorithm with respect to experimental noise is shown by exte
nsive simulation studies. The algorithm is tested on a subset of 2029 cDNA
clones coming from 15 different genes From a cDNA library derived from huma
n dendritic cells. Furthermore, the clustering of these 2029 cDNA clones is
demonstrated when the entire set of 76,032 cDNA clones is processed.