ITA
ENG

COMPARING INFORMATION-THEORETIC ATTRIBUTE SELECTION MEASURES - A STATISTICAL APPROACH

Authors

DEMANTARAS RL CERQUIDES J GARCIA P

Citation

Rl. Demantaras et al., COMPARING INFORMATION-THEORETIC ATTRIBUTE SELECTION MEASURES - A STATISTICAL APPROACH, AI communications, 11(2), 1998, pp. 91-100

Citations number

Categorie Soggetti

Computer Science Artificial Intelligence","Computer Science Artificial Intelligence

Journal title

AI communications → ACNP

ISSN journal

09217126

Volume

Issue

Year of publication

1998

Pages

91 - 100

Database

ISI

SICI code

0921-7126(1998)11:2<91:CIASM->2.0.ZU;2-C

Abstract

In [7], a new information-theoretic attribute selection method for dec ision tree induction was introduced. This method consists in computing for each node, a distance between the partition generated by the valu es of each candidate attribute in the nude and the correct partition o f the subset of training examples in this node. The chosen attribute i s that whose corresponding partition is the closest to the correct par tition (i.e., the partition that perfectly classifies the training dat a). In that paper it was also formally proved that such distance is no t biased towards attributes with a large number of values in the sense specified by Quinlan in [12] and some initial experimental evidence s uggests that the predictive accuracy of the induced trees was not sign ificantly different from that obtained with the most widely used infor mation theoretic attribute selection measures, that is, Quinlan's Gain and Quinlan's Cain Ratio. However, it seemed that the distance induce d smaller trees especially when the attributes had different number of values. In that paper it was not confirmed that the differences were statistically significant due to the small number of experiments perfo rmed. In this paper we report experimental results that allow to confi rm that the distance induces trees whose size, without losing accuracy , is not significantly different from those obtained using Quinlan's C ain but smaller than those obtained with Quinlan's Gain Ratio. These e xperimental results are supported by a statistical analysis performed using two statistical hypothesis tests: the sign lest and the signed r ank test.