Unlabeled document collections are becoming increasingly common and availab
le; mining such data sets represents a major contemporary challenge. Using
words as features, text documents are often represented as high-dimensional
and sparse vectors-a few thousand dimensions and a sparsity of 95 to 99% i
s typical. In this paper, we study a certain spherical k-means algorithm fo
r clustering such document vectors. The algorithm outputs k disjoint cluste
rs each with a concept vector that is the centroid of the cluster normalize
d to have unit Euclidean norm. As our first contribution, we empirically de
monstrate that, owing to the high-dimensionality and sparsity of the text d
ata, the clusters produced by the algorithm have a certain "fractal-like" a
nd "self-similar" behavior. As our second contribution, we introduce concep
t decompositions to approximate the matrix of document vectors; these decom
positions are obtained by taking the least-squares approximation onto the l
inear subspace spanned by all the concept vectors. We empirically establish
that the approximation errors of the concept decompositions are close to t
he best possible, namely, to truncated singular value decompositions. As ou
r third contribution, we show that the concept vectors are localized in the
word space, are sparse, and tend towards orthonormality. In contrast, the
singular vectors are global in the word space and are dense. Nonetheless, w
e observe the surprising fact that the linear subspaces spanned by the conc
ept vectors and the leading singular vectors are quite close in the sense o
f small principal angles between them. In conclusion, the concept vectors p
roduced by the spherical k-means algorithm constitute a powerful sparse and
localized "basis" for text data sets.