Java has become a language of choice for applications executing in heteroge
neous environments utilising distributed objects and multithreading. To han
dle large data sets, scalable and efficient implementations of data mining
approaches are required, generally employing computationally intensive algo
rithms. Conventional Java implementations do not directly provide support f
or the data structures often encountered in such algorithms, and they also
lack repeatability in numerical precision across platforms.
This paper describes a distributed framework employing task and data parall
elism and implemented in high performance Java (HPJava). Issues of interest
for data mining algorithms are identified, and possible solutions discusse
d for overcoming limitations in the Java Virtual Machine. The framework sup
ports parallelism across workstation clusters, using the message passing in
terface as middleware, and can support different analysis algorithms, wrapp
ed as Java objects, and linked to various databases using the Java database
connectivity interface. Guidelines are provided for implementing parallel
and distributed data mining on large data sets, and a proof-of-concept data
mining application is analysed using a neural network.