Concept decompositions for large sparse text data using clustering

Citation
Is. Dhillon et Ds. Modha, Concept decompositions for large sparse text data using clustering, MACH LEARN, 42(1-2), 2001, pp. 143-175
Citations number
40
Categorie Soggetti
AI Robotics and Automatic Control
Journal title
MACHINE LEARNING
ISSN journal
08856125 → ACNP
Volume
42
Issue
1-2
Year of publication
2001
Pages
143 - 175
Database
ISI
SICI code
0885-6125(200101)42:1-2<143:CDFLST>2.0.ZU;2-2
Abstract
Unlabeled document collections are becoming increasingly common and availab le; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high-dimensional and sparse vectors-a few thousand dimensions and a sparsity of 95 to 99% i s typical. In this paper, we study a certain spherical k-means algorithm fo r clustering such document vectors. The algorithm outputs k disjoint cluste rs each with a concept vector that is the centroid of the cluster normalize d to have unit Euclidean norm. As our first contribution, we empirically de monstrate that, owing to the high-dimensionality and sparsity of the text d ata, the clusters produced by the algorithm have a certain "fractal-like" a nd "self-similar" behavior. As our second contribution, we introduce concep t decompositions to approximate the matrix of document vectors; these decom positions are obtained by taking the least-squares approximation onto the l inear subspace spanned by all the concept vectors. We empirically establish that the approximation errors of the concept decompositions are close to t he best possible, namely, to truncated singular value decompositions. As ou r third contribution, we show that the concept vectors are localized in the word space, are sparse, and tend towards orthonormality. In contrast, the singular vectors are global in the word space and are dense. Nonetheless, w e observe the surprising fact that the linear subspaces spanned by the conc ept vectors and the leading singular vectors are quite close in the sense o f small principal angles between them. In conclusion, the concept vectors p roduced by the spherical k-means algorithm constitute a powerful sparse and localized "basis" for text data sets.