Large-scale clustering of cDNA-fingerprinting data

Citation
R. Herwig et al., Large-scale clustering of cDNA-fingerprinting data, GENOME RES, 9(11), 1999, pp. 1093-1105
Citations number
30
Categorie Soggetti
Molecular Biology & Genetics
Journal title
GENOME RESEARCH
ISSN journal
10889051 → ACNP
Volume
9
Issue
11
Year of publication
1999
Pages
1093 - 1105
Database
ISI
SICI code
1088-9051(199911)9:11<1093:LCOCD>2.0.ZU;2-2
Abstract
Clustering is one of the main mathematical challenges in large-scale gene e xpression analysis. We describe a clustering procedure based on a sequentia l k-means algorithm with additional I refinements that is able to handle hi gh-throughput data in the order of hundreds of thousands of data items meas ured on hundreds of variables. The practical motivation for our algorithm i s oligonucleotide fingerprinting-a method for simultaneous determination of expression level for every active gene of a specific tissue-although the a lgorithm can be applied as well Co other large-scale projects like EST clus tering and qualitative clustering of DNA-chip data. As a pairwise similarit y measure between two p-dimensional data points, x and y, we introduce mutu al information that can be interpreted as the amount of information about x in y,and vice versa. We show that For our purposes this measure is superio r to commonly used metric distances, for example, Euclidean distance. We al so introduce a modified version of mutual information as a novel method for validating clustering results when the true clustering is known. The perfo rmance of our algorithm with respect to experimental noise is shown by exte nsive simulation studies. The algorithm is tested on a subset of 2029 cDNA clones coming from 15 different genes From a cDNA library derived from huma n dendritic cells. Furthermore, the clustering of these 2029 cDNA clones is demonstrated when the entire set of 76,032 cDNA clones is processed.