ON MEASURING AND CORRECTING THE EFFECTS OF DATA MINING AND MODEL SELECTION

Authors
Citation
Jm. Ye, ON MEASURING AND CORRECTING THE EFFECTS OF DATA MINING AND MODEL SELECTION, Journal of the American Statistical Association, 93(441), 1998, pp. 120-131
Citations number
28
Categorie Soggetti
Statistic & Probability","Statistic & Probability
Volume
93
Issue
441
Year of publication
1998
Pages
120 - 131
Database
ISI
SICI code
Abstract
In the theory of linear models, the concept of degrees of freedom play s an important role. This concept is often used for measurement of mod el complexity, for obtaining an unbiased estimate of the error varianc e, and for comparison of different models. I have developed a concept of generalized degrees of freedom (GDF) that is applicable to complex modeling procedures. The definition is based on the sum of the sensiti vity of each fitted value to perturbation in the corresponding observe d value. The concept is nonasymptotic in nature and does not require a nalytic knowledge of the modeling procedures. The concept of GDF offer s a unified framework under which complex and highly irregular modelin g procedures can be analyzed in the same way as classical linear model s. By using this framework, many difficult problems can be solved easi ly. For example, one can now measure the number of observations used i n a variable selection process. Different modeling procedures, such as a tree-based regression and a projection pursuit regression, can be c ompared on the basis of their residual sums of squares and the GDF tha t they cost. I apply the proposed framework to measure the effect of v ariable selection in linear models, leading to corrections of selectio n bias in various goodness-of-fit statistics. The theory also has inte resting implications for the effect of general model searching by a hu man modeler.