This paper demonstrates experimentally that concluding which induction algo
rithm is more accurate based on the results from one partition of the insta
nces into the cross-validation folds may lead to statistically erroneous co
nclusions. Comparing two decision tree induction and one naive-bayes induct
ion algorithms, we find situations in which one algorithm is judged more ac
curate at the p = 0.05 level with one partition of the training instances b
ut the other algorithm is judged more accurate at the p = 0.05 level with a
n alternate partition. We recommend a new significance procedure that invol
ves performing cross-validation using multiple instance-space partitions. S
ignificance is determined by applying the paired Student t-test separately
to the results from each cross-validation partition, averaging their values
, and converting this averaged value into a significance value.