Data analysis by positive decision trees

Citation
K. Makino et al., Data analysis by positive decision trees, IEICE T INF, E82D(1), 1999, pp. 76-88
Citations number
17
Categorie Soggetti
Information Tecnology & Communication Systems
Journal title
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS
ISSN journal
09168532 → ACNP
Volume
E82D
Issue
1
Year of publication
1999
Pages
76 - 88
Database
ISI
SICI code
0916-8532(199901)E82D:1<76:DABPDT>2.0.ZU;2-X
Abstract
Decision trees are used as a convenient means to explain given positive exa mples and negative examples, which is a form of data mining and knowledge d iscovery. Standard methods such as ID3 may provide non-monotonic decision t rees in the sense that data with larger values in all attributes are someti mes classified into a class with a smaller output value. (In the case of bi nary data, this is equivalent to saying that the discriminant Boolean funct ion that the decision tree represents is not positive.) A motivation of thi s study comes from an observation that real world data are often positive, and in such cases it is natural to build decision trees which represent pos itive (i.e., monotone) discriminant functions. For this, we propose how to modify the existing procedures such as ID3, so that the resulting decision tree represents a positive discriminant function. In this procedure, we add some new data to recover the positivity of data. which the original data h ad but was lost in the process of decomposing data sets by such methods as ID3. To compare the performance of our method with existing methods, we tes t (1) positive data, which are randomly generated from a hidden positive Bo olean function after adding dummy attributes, and (2) breast cancer data as an example of the real-world data. The experimental results on (1) tell th at, although the sizes of positive decision trees are relatively larger tha n those without positivity assumption, positive decision trees exhibit high er accuracy and tend to choose correct attributes, on which the hidden posi tive Boolean function is defined. For the breast cancer data set, we also o bserve a similar tendency; i.e., positive decision trees are larger but giv e higher accuracy.