The efficient mining of large, commercially credible, databases requires a
solution to at least two problems: (a) better integration between existing
Knowledge Discovery algorithms and popular DBMS; (b) ability to exploit opp
ortunities for computational speedup such as data parallelism. Both problem
s need to be addressed in a generic manner, since the stated requirements o
f end-users cover a range of data mining paradigms, DBMS, and (parallel) pl
atforms. In this paper we present a family of generic, set-based, primitive
operations for Knowledge Discovery in Databases (KDD). We show how a numbe
r of well-known KDD classification metrics, drawn from paradigms such as Ba
yesian classifiers, Rule-Induction/Decision Tree algorithms, Instance-Based
Learning methods, and Genetic Programming, can all be computed via our gen
eric primitives. We then show how these primitives may be mapped into SQL a
nd, where appropriate, optimised for good performance in respect of practic
al factors such as client-server communication overheads. We demonstrate ho
w our primitives can support C4.5, a widely-used rule induction system. Per
formance evaluation figures are presented for commercially available parall
el platforms, such as the IBM SP/2. (C) 1999 Elsevier Science B.V. All righ
ts reserved.