Hierarchical model-based clustering for large datasets

Authors
Citation
C. Posse, Hierarchical model-based clustering for large datasets, J COMPU G S, 10(3), 2001, pp. 464-486
Citations number
34
Categorie Soggetti
Mathematics
Journal title
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS
ISSN journal
10618600 → ACNP
Volume
10
Issue
3
Year of publication
2001
Pages
464 - 486
Database
ISI
SICI code
1061-8600(200109)10:3<464:HMCFLD>2.0.ZU;2-K
Abstract
In recent years, hierarchical model-based clustering has provided promising results in a variety of applications. However, its use with large datasets has been hindered by a time and memory complexity that are at least quadra tic in the number of observations. To overcome this difficulty, this articl e proposes to start the hierarchical agglomeration from an efficient classi fication of the data in many classes rather than from the usual set of sing leton clusters. This initial partition is derived from a subgraph of the mi nimum spanning tree associated with the data. To this end, we develop graph ical tools that assess the presence of clusters in the data and uncover obs ervations difficult to classify. We use this approach to analyze two large, real datasets: a multiband MRI image of the human brain and data on global precipitation climatology. We use the real datasets to discuss ways of int egrating the spatial information in the clustering analysis. We focus on tw o-stage methods, in which a second stage of processing using established me thods is applied to the output from the algorithm presented in this article , viewed as a first stage.