This paper presents an algorithm for clustering large n-tuple discrete-valu
ed data and describes how it is used for analyzing biomolecular data. The a
lgorithm consists of a cluster initiation phase and a cluster regrouping ph
ase. The former involves the analysis of the nearest-neighbour distance con
figuration using the probability estimate of each sample in the data set. I
t considers only a subset of variables with "consigned" or "transferred" in
terdependency. That is, these variables reflect many of the data interdepen
dencies among the ensemble. The latter involves: (1) the selection of relev
ant attribute values based on their statistical dependence on the initial c
lusters formed, and (2) the inference of the cluster label based on the wei
ght of evidence of the selected attribute values of the samples pertaining
to a certain cluster over the others. Because only a subset of selected att
ribute values is considered, the final clusters can be of any "shape" and n
ot necessarily "globular". Hence, it is not affected by the presence of irr
elevant attribute values. Experimental results on several control data sets
as well as a biomolecular data set demonstrate its efficacy for molecular
sequence analysis and taxonomy analysis. (C) 2001 Elsevier Science Inc. All
rights reserved.