ITA
ENG

RainForest - A framework for fast decision tree construction of large datasets

Authors

Gehrke, J Ramakrishnan, R Ganti, V

Citation

J. Gehrke et al., RainForest - A framework for fast decision tree construction of large datasets, DATA M K D, 4(2-3), 2000, pp. 127-162

Citations number

Categorie Soggetti

AI Robotics and Automatic Control

Journal title

DATA MINING AND KNOWLEDGE DISCOVERY

ISSN journal

13845810 → ACNP

Volume

Issue

2-3

Year of publication

2000

Pages

127 - 162

Database

ISI

SICI code

1384-5810(200007)4:2-3<127:R-AFFF>2.0.ZU;2-G

Abstract

Classification of large datasets is an important data mining problem. Many classification algorithms have been proposed in the literature, but studies have shown that so far no algorithm uniformly outperforms all other algori thms in terms of quality. In this paper, we present a unifying framework ca lled Rain Forest for classification tree construction that separates the sc alability aspects of algorithms for constructing a tree from the central fe atures that determine the quality of the tree. The generic algorithm is eas y to instantiate with specific split selection methods from the literature (including C4.5, CART, CHAID, FACT, ID3 and extensions, SLIQ, SPRINT and QU EST). In addition to its generality, in that it yields scalable versions of a wid e range of classification algorithms, our approach also offers performance improvements of over a factor of three over the SPRINT algorithm, the faste st scalable classification algorithm proposed previously. In contrast to SP RINT, however, our generic algorithm requires a certain minimum amount of m ain memory, proportional to the set of distinct values in a column of the i nput relation. Given current main memory costs, this requirement is readily met in most if not all workloads.