Background: While several algorithms for the comparison of univariate distr
ibutions arising from flow cytometric analyses have been developed and stud
ied for many years, algorithms for comparing multivariate distributions rem
ain elusive. Such algorithms could be useful for comparing differences betw
een samples based on several independent measurements, rather than differen
ces based on any single measurement. It is conceivable that distributions c
ould be completely distinct in multivariate space, but unresolvable in any
combination of univariate histograms. Multivariate comparisons could also b
e useful for providing feedback about instrument stability, when only subtl
e changes in measurements are occurring.
Methods: We apply a variant of Probability Binning, described in the accomp
anying article, to multidimensional data. In this approach, hyper-rectangle
s of n dimensions (where n is the number of measurements being compared) co
mprise the bins used for the chi-squared statistic. These hyper-dimensional
bins are constructed such that the control sample has the same number of e
vents in each bin; the bins are then applied to the test samples for chi-sq
uared calculations.
Results: Using a Monte-Carlo simulation, we determined the distribution of
chi-squared values obtained by comparing sets of events from the same distr
ibution; this distribution of chi-squared values was identical as for the u
nivariate algorithm. Hence, the same formulae can be used to construct a me
tric, analogous to a t-score, that estimates the probability with which dis
tributions are distinct. As for univariate comparisons, this metric scales
with the difference between two distributions, and can be used to rank samp
les according to similarity to a control. We apply the algorithm to multiva
riate immunophenotyping data, and demonstrate that it can be used to discri
minate distinct samples and to rank samples according to a biologically-mea
ningful difference.
Conclusion: Probability binning, as shown here, provides a useful metric fo
r determining the probability with which two or more multivariate distribut
ions represent distinct sets of data. The metric can be used to identify th
e similarity or dissimilarity of samples. Finally, as demonstrated in the a
ccompanying paper, the algorithm can be used to gate on events in one sampl
e that are different from a control sample, even if those events cannot be
distinguished on the basis of any combination of univariate or bivariate di
splays. Cytometry 45:47-55, 2001. Published 2001 Wiley-Liss, Inc.dagger.