CLASSIFICATION OF BINARY VECTORS BY STOCHASTIC COMPLEXITY

Citation
M. Gyllenberg et al., CLASSIFICATION OF BINARY VECTORS BY STOCHASTIC COMPLEXITY, Journal of Multivariate Analysis, 63(1), 1997, pp. 47-72
Citations number
54
Categorie Soggetti
Statistic & Probability","Statistic & Probability
ISSN journal
0047259X
Volume
63
Issue
1
Year of publication
1997
Pages
47 - 72
Database
ISI
SICI code
0047-259X(1997)63:1<47:COBVBS>2.0.ZU;2-Y
Abstract
Stochastic complexity is treated as a tool of classification, i.e., of inferring the number of classes, the class descriptions, and the clas s memberships for a given data set of binary vectors. The stochastic c omplexity is evaluated with respect to the family of statistical model s defined by finite mixtures of multivariate Bernoulli distributions o btained by the principle of maximum entropy. It is shown that stochast ic complexity is asymptotically related to the classification maximum likelihood estimate. The Formulae for stochastic complexity have an in terpretation as minimum code lengths for certain universal source code s for storing the binary data Vectors and their assignments into the c lasses in a classification. There is also a decomposition of the class ification uncertainty in a sum of an intraclass uncertainty, an interc lass uncertainty, and a special parsimony term. It is shown that minim izing the stochastic complexity amounts to maximizing the information content of the classification. An algorithm of alternating minimizatio n of stochastic complexity is given. We discuss the relation of the me thod to the AUTOCLASS system of Bayesian classification. The applicati on of classification by stochastic complexity to an extensive data bas e of strains of Enterobacteriaceae is described. (C) 1997 Academic Pre ss.