We propose a method (the 'gap statistic') for estimating the number of clus
ters (groups) in a set of data. The technique uses the output of any cluste
ring algorithm (e.g. K-means or hierarchical), comparing the change in with
in-cluster dispersion with that expected under an appropriate reference nul
l distribution. Some theory is developed for the proposal and a simulation
study shows that the gap statistic usually outperforms other methods that h
ave been proposed in the literature.