Estimating the number of clusters in a data set via the gap statistic

Citation
R. Tibshirani et al., Estimating the number of clusters in a data set via the gap statistic, J ROY STA B, 63, 2001, pp. 411-423
Citations number
20
Categorie Soggetti
Mathematics
Journal title
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY
ISSN journal
13697412 → ACNP
Volume
63
Year of publication
2001
Part
2
Pages
411 - 423
Database
ISI
SICI code
1369-7412(2001)63:<411:ETNOCI>2.0.ZU;2-A
Abstract
We propose a method (the 'gap statistic') for estimating the number of clus ters (groups) in a set of data. The technique uses the output of any cluste ring algorithm (e.g. K-means or hierarchical), comparing the change in with in-cluster dispersion with that expected under an appropriate reference nul l distribution. Some theory is developed for the proposal and a simulation study shows that the gap statistic usually outperforms other methods that h ave been proposed in the literature.