ITA
ENG

Concept decompositions for large sparse text data using clustering

Authors

Dhillon, IS Modha, DS

Citation

Is. Dhillon et Ds. Modha, Concept decompositions for large sparse text data using clustering, MACH LEARN, 42(1-2), 2001, pp. 143-175

Citations number

Categorie Soggetti

AI Robotics and Automatic Control

Journal title

MACHINE LEARNING

ISSN journal

08856125 → ACNP

Volume

Issue

1-2

Year of publication

2001

Pages

143 - 175

Database

ISI

SICI code

0885-6125(200101)42:1-2<143:CDFLST>2.0.ZU;2-2

Abstract

Unlabeled document collections are becoming increasingly common and availab le; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high-dimensional and sparse vectors-a few thousand dimensions and a sparsity of 95 to 99% i s typical. In this paper, we study a certain spherical k-means algorithm fo r clustering such document vectors. The algorithm outputs k disjoint cluste rs each with a concept vector that is the centroid of the cluster normalize d to have unit Euclidean norm. As our first contribution, we empirically de monstrate that, owing to the high-dimensionality and sparsity of the text d ata, the clusters produced by the algorithm have a certain "fractal-like" a nd "self-similar" behavior. As our second contribution, we introduce concep t decompositions to approximate the matrix of document vectors; these decom positions are obtained by taking the least-squares approximation onto the l inear subspace spanned by all the concept vectors. We empirically establish that the approximation errors of the concept decompositions are close to t he best possible, namely, to truncated singular value decompositions. As ou r third contribution, we show that the concept vectors are localized in the word space, are sparse, and tend towards orthonormality. In contrast, the singular vectors are global in the word space and are dense. Nonetheless, w e observe the surprising fact that the linear subspaces spanned by the conc ept vectors and the leading singular vectors are quite close in the sense o f small principal angles between them. In conclusion, the concept vectors p roduced by the spherical k-means algorithm constitute a powerful sparse and localized "basis" for text data sets.