Using generalized correlation to effect variable selection in very high dimensional problems

Citation
Hall, Peter et Miller, Hugh, Using generalized correlation to effect variable selection in very high dimensional problems, Journal of computational and graphical statistics , 18(3), 2009, pp. 533-550
ISSN journal
10618600
Volume
18
Issue
3
Year of publication
2009
Pages
533 - 550
Database
ACNP
SICI code
Abstract
Using the traditional linear model to implement variable selection can perform very effectively in some cases, provided the response to relevant components is approximately monotone and its gradient changes only slowly. In other circumstances, nonlinearity of response can result in significant vector components being overlooked. Even if good results are obtained by linear model fitting, they can sometimes be bettered by using a nonlinear approach. These circumstances can arise in practice, with real data, and they motivate alternative methodologies. We suggest an approach based on ranking generalized empirical correlations between the response variable and components of the explanatory vector. This technique is not prediction-based, and can identify variables that are influential but not explicitly part of a predictive model. We explore the method’s performance for real and simulated data, and give a theoretical argument demonstrating its validity. The method can also be used in conjunction with, rather than as an alternative to, conventional prediction-based variable selections, by providing a preliminary “massive dimension reduction” step as a prelude to using alternative techniques (e.g., the adaptive lasso) that do not always cope well with very high dimensions. Supplemental materials relating to the numerical sections of this paper are available online.