A highly popular method for examining the stability of a data clustering is
to split the data into two parts, cluster the observations in Part A, assi
gn the objects in Part B to their nearest centroid in Part A, and then inde
pendently cluster the Part B objects. One then examines how close the two p
artitions are (say, by the Rand measure). Another proposal is to split the
data into k parts, and see how their centroids cluster. By means of synthet
ic data analyses, we demonstrate that these approaches fail to identify the
appropriate number of clusters, particularly as sample size becomes large
and the variables exhibit higher correlations.