COMPARING INFORMATION-THEORETIC ATTRIBUTE SELECTION MEASURES - A STATISTICAL APPROACH

Citation
Rl. Demantaras et al., COMPARING INFORMATION-THEORETIC ATTRIBUTE SELECTION MEASURES - A STATISTICAL APPROACH, AI communications, 11(2), 1998, pp. 91-100
Citations number
14
Categorie Soggetti
Computer Science Artificial Intelligence","Computer Science Artificial Intelligence
Journal title
ISSN journal
09217126
Volume
11
Issue
2
Year of publication
1998
Pages
91 - 100
Database
ISI
SICI code
0921-7126(1998)11:2<91:CIASM->2.0.ZU;2-C
Abstract
In [7], a new information-theoretic attribute selection method for dec ision tree induction was introduced. This method consists in computing for each node, a distance between the partition generated by the valu es of each candidate attribute in the nude and the correct partition o f the subset of training examples in this node. The chosen attribute i s that whose corresponding partition is the closest to the correct par tition (i.e., the partition that perfectly classifies the training dat a). In that paper it was also formally proved that such distance is no t biased towards attributes with a large number of values in the sense specified by Quinlan in [12] and some initial experimental evidence s uggests that the predictive accuracy of the induced trees was not sign ificantly different from that obtained with the most widely used infor mation theoretic attribute selection measures, that is, Quinlan's Gain and Quinlan's Cain Ratio. However, it seemed that the distance induce d smaller trees especially when the attributes had different number of values. In that paper it was not confirmed that the differences were statistically significant due to the small number of experiments perfo rmed. In this paper we report experimental results that allow to confi rm that the distance induces trees whose size, without losing accuracy , is not significantly different from those obtained using Quinlan's C ain but smaller than those obtained with Quinlan's Gain Ratio. These e xperimental results are supported by a statistical analysis performed using two statistical hypothesis tests: the sign lest and the signed r ank test.