ITA
ENG

Ties in proximity and clustering compounds

Authors

MacCuish, J Nicolaou, C MacCuish, NE

Citation

J. Maccuish et al., Ties in proximity and clustering compounds, J CHEM INF, 41(1), 2001, pp. 134-146

Citations number

Categorie Soggetti

Chemistry

Journal title

JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES

ISSN journal

00952338 → ACNP

Volume

Issue

Year of publication

2001

Pages

134 - 146

Database

ISI

SICI code

0095-2338(200101/02)41:1<134:TIPACC>2.0.ZU;2-T

Abstract

Hierarchical clustering algorithms such as Wards or complete-link are commo nly used in compound selection and diversity analysis. Many such applicatio ns utilize binary representations of chemical structures, such as MACCS key s or Daylight fingerprints, and dissimilarity measures, such as the Euclide an or the Soergel measure. However, hierarchical clustering algorithms can generate ambiguous results owing to what is known in the cluster analysis l iterature as the ties in proximity problem, i.e., compounds or clusters of compounds that are equidistant from a compound or cluster in a given collec tion. Ambiguous ties can occur when clustering only a few hundred compounds , and the larger the number of compounds to be clustered, the greater the c hance for significant ambiguity. Namely, as the number of "ties in proximit y" increases relative to the total number of proximities, the possibility o f ambiguity also increases. To ensure that there are no ambiguous ties, we show by a probabilistic argument that the number of compounds needs to be l ess than 2(n(1/4)), where n is the total number of proximities, and the mea sure used to generate the proximities creates a uniform distribution withou t statistically preferred values. The common measures do not produce unifor mly distributed proximities, but rather statistically preferred values that tend to increase the number of ties in proximity. Hence, the number of pos sible proximities and the distribution of statistically preferred values of a similarity measure, given a bit vector representation of a specific leng th, are directly related to the number of ties in proximities for a given d ata set. We explore the ties in proximity problem, using a number of chemic al collections with varying degrees of diversity, given several common simi larity measures and clustering algorithms. Our results are consistent with our probabilistic argument and show that this problem is significant for re latively small compound sets.