ITA
ENG

Estimating the number of clusters in a data set via the gap statistic

Authors

Tibshirani, R Walther, G Hastie, T

Citation

R. Tibshirani et al., Estimating the number of clusters in a data set via the gap statistic, J ROY STA B, 63, 2001, pp. 411-423

Citations number

Categorie Soggetti

Mathematics

Journal title

JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY

ISSN journal

13697412 → ACNP

Volume

Year of publication

2001

Part

Pages

411 - 423

Database

ISI

SICI code

1369-7412(2001)63:<411:ETNOCI>2.0.ZU;2-A

Abstract

We propose a method (the 'gap statistic') for estimating the number of clus ters (groups) in a set of data. The technique uses the output of any cluste ring algorithm (e.g. K-means or hierarchical), comparing the change in with in-cluster dispersion with that expected under an appropriate reference nul l distribution. Some theory is developed for the proposal and a simulation study shows that the gap statistic usually outperforms other methods that h ave been proposed in the literature.