A new method for mining regression classes in large data sets

Citation
Y. Leung et al., A new method for mining regression classes in large data sets, IEEE PATT A, 23(1), 2001, pp. 5-21
Citations number
35
Categorie Soggetti
AI Robotics and Automatic Control
Journal title
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
ISSN journal
01628828 → ACNP
Volume
23
Issue
1
Year of publication
2001
Pages
5 - 21
Database
ISI
SICI code
0162-8828(200101)23:1<5:ANMFMR>2.0.ZU;2-S
Abstract
Extracting patterns and models of interest from targe databases is attracti ng much attention in a variety of disciplines. Knowledge discovery in datab ases (KDD) and data mining (DM) are areas of common interest to researchers in machine learning, pattern recognition, statistics, artificial intellige nce, and high performance computing. An effective and robust method, coined regression-class mixture decomposition (RCMD) method, is proposed in this paper for the mining of regression classes in large data sets, especially t hose contaminated by noise. A new concept, called "regression class" which is defined as a subset of the data set that is subject to a regression mode l, is proposed as a basic building block on which the mining process is bas ed. A large data set is treated as a mixture population in which there are many such regression classes and others not accounted for by the regression models. Iterative and genetic-based algorithms for the optimization of the objective function in the RCMD method are also constructed. It is demonstr ated that the RCMD method can resist a very large proportion of noisy data. identify each regression class. assign an inlier set of data points suppor ting each identified regression class, and determine the a priori unknown n umber of statistically valid models in the data set. Although the models ar e extracted sequentially, the final result is almost independent of the ext raction order due to a novel dynamic classification strategy employed in th e handling of overlapping regression classes. The effectiveness and robustn ess of the RCMD method are substantiated by a set of simulation experiments and a real-life application showing the way it can be used to fit mixed da ta to linear regression classes and nonlinear structures in various situati ons.