This paper examines the performance and memory-access behavior of the C4.5
decision-tree induction program, a representative example of data mining ap
plications, for both uniprocessor and parallel implementations. The goals o
f this paper are to characterize C4.5, in particular its memory hierarchy u
sage, and to decrease the run-time of C4.5 via algorithmic improvement and
parallelization. Performance is studied via RSIM, an execution driven simul
ator, for three uniprocessor models that exploit instruction level parallel
ism to varying degrees. This paper makes the following four contributions.
The first contribution is presenting a complete characterization of the C4.
5 decision-tree induction program. The results show that with the exception
of the input data set, the working set fits into an 8-Kbyte data cache; th
e instruction working set also fits into an 8-Kbyte instruction cache. For
data sets larger than the L2 cache, performance is limited by accesses to m
ain memory. The results further establish that four-way issue can provide u
p to a factor of two performance improvement over single-issue for larger L
2 caches; for smaller L2 caches, out-of-order dispatch provides a large per
formance improvement over in-order dispatch. The second contribution made b
y this paper is examining the effect on the memory hierarchy of changing th
e layout of the input dataset in memory, showing again that the performance
is limited by memory accesses. One proposed data layout decreases the dyna
mic instruction count by up to 24%, but usually results in lower performanc
e due to worse cache behavior. Another proposed data layout does not improv
e the dynamic instruction count over the original layout, but has better ca
che behavior and decreases the run-time by up to a factor of two. Third, th
is paper presents the first decision-tree induction program parallelized fo
r a ccNUMA architecture. A method for splitting the decision tree hash tabl
e is discussed that allows the hash table to be updated and accessed simult
aneously without the use of locks. The performance of the parallel version
is compared to the original version of C4.5 and a uniprocessor version of C
4.5 using the best data layout found. Speedup curves from a six-processor S
un E4000 SMP system show a speedup on the induction step of 3.99, and simul
ation results show that the performance is mostly unaffected by increasing
the remote memory access time until it is over a factor of ten greater than
the local memory access time. Last, this paper characterizes the paralleli
zed decision-tree induction program. Compared to the uniprocessor version,
the parallel version exerts significantly less pressure on the memory hiera
rchy, with the exception of having a much larger first level data working s
et. (C) 2001 Academic Press.