We present a nonparametric approach to deriving clusters from categorical (
nominal scale) data using a new clustering procedure called K-modes, which
is analogous to the traditional K-Means procedure (MacQueen 1967) for clust
ering interval scale data. Unlike most existing methods for clustering nomi
nal scale data, the K-modes procedure explicitly optimizes a loss function
based on the Lo norm (defined as the limit of an L-p norm as p approaches z
ero).
In Monte Carlo simulations, both K-modes and latent class procedures (e.g.,
Goodman 1974) performed with equal efficiency in recovering a known underl
ying cluster structure. However, K-modes is an order of magnitude faster th
an the latent class procedure in speed and suffers from fewer problems of l
ocal optima than do latent class procedures. For data sets involving a larg
e number of categorical variables, latent class procedures become computati
onally extremely slow and hence infeasible.
We conjecture that, although in some cases latent class procedures might pe
rform better than K-modes, it could out-perform latent class procedures in
other cases. Hence, we recommend that these two approaches be used as "comp
lementary" procedures in performing cluster analysis. We also present an em
pirical comparison of K-modes and latent class, where the former method pre
vails.