How many clusters? Which clustering method? Answers via model-based cluster analysis

Citation
C. Fraley et Ae. Raftery, How many clusters? Which clustering method? Answers via model-based cluster analysis, COMPUTER J, 41(8), 1998, pp. 578-588
Citations number
65
Categorie Soggetti
Computer Science & Engineering
Journal title
COMPUTER JOURNAL
ISSN journal
00104620 → ACNP
Volume
41
Issue
8
Year of publication
1998
Pages
578 - 588
Database
ISI
SICI code
0010-4620(1998)41:8<578:HMCWCM>2.0.ZU;2-P
Abstract
We consider the problem of determining the structure of clustered data, wit hout prior knowledge of the number of clusters or any other information abo ut their composition. Data are represented by a mixture model in which each component corresponds to a different cluster. Models with varying geometri c properties are obtained through Gaussian components with different parame trizations and cross-cluster constraints, Noise and outliers can be modelle d by adding a Poisson process component. Partitions are determined by the e xpectation-maximization (EM) algorithm for maximum likelihood, with initial values from agglomerative hierarchical clustering Models are compared usin g an approximation to the Bayes factor based on the Bayesian information cr iterion (BIC); unlike significance tests, this allows comparison of more th an two models at the same time, and removes the restriction that the models compared be nested, The problems of determining the number of clusters and the clustering method are solved simultaneously by choosing the best model . Moreover, the EM result provides a measure of uncertainty about the assoc iated classification of each data point. Examples are given, showing that t his approach can give performance that is much better than standard procedu res, which often fail to identify groups that are either overlapping or of varying sizes and shapes.