ITA
ENG

Controlling overfitting in software quality models: Experiments with regression trees and classification

Authors

Khoshgoftaar, TM Allen, EB Deng, JY

Citation

Tm. Khoshgoftaar et al., Controlling overfitting in software quality models: Experiments with regression trees and classification, SEVENTH INTERNATIONAL SOFTWARE METRICS SYMPOSIUM - METRICS 2001, PROCEEDINGS, 2000, pp. 190-198

Citations number

Categorie Soggetti

Current Book Contents

Journal title

SEVENTH INTERNATIONAL SOFTWARE METRICS SYMPOSIUM - METRICS 2001, PROCEEDINGS → ACNP

Year of publication

2000

Pages

190 - 198

Database

ISI

SICI code

Abstract

In this day of "faster, cheaper, better" release cycles, software developer s must focus enhancement efforts on those modules that need improvement the most. Predictions of which modules are likely to have faults during operat ions is an important tool to guide such improvement efforts during maintena nce. Tree-based models are attractive because they readily model nonmonotonic re lationships between a response variable and predictors. However, tree-based models are vulnerable to overfitting, where the model reflects the structu re of the training data set too closely. Even though a model appears to be accurate an training data, if overfitted, it may be much less accurate when applied to a current data set. To account for the severe consequences of m isclassifying fault-prone modules, our measure of overfitting is based on e xpected costs of misclassification, rather than the total number of misclas sifications. In this paper, we apply a regression-tree algorithm in the S-Plus system to classification of software modules by application of our classification ru le that accounts for the preferred balance between misclassification rates. We conducted a case study of a very large legacy telecommunication system, and investigated two parameters of the regression-tree algorithm. We found here that minimum deviance was strongly related to overfitting, and can be used to control it, but the effect of minimum node size on overfitting is ambiguous.