Incorporating auxiliary information for improved prediction in high-dimensional datasets: an ensemble of shrinkage approaches

Citation
S. Boonstra, Philip et al., Incorporating auxiliary information for improved prediction in high-dimensional datasets: an ensemble of shrinkage approaches, Biostatistics (Oxford. Print) , 14(2), 2013, pp. 259-272
ISSN journal
14654644
Volume
14
Issue
2
Year of publication
2013
Pages
259 - 272
Database
ACNP
SICI code
Abstract
With advancement in genomic technologies, it is common that two high-dimensional datasets are available, both measuring the same underlying biological phenomenon with different techniques.We consider predicting a continuous outcome Y using X, a set of p markers which is the best available measure of the underlying biological process. This same biological process may also be measured by W, coming from a prior technology but correlated with X.On a moderately sized sample, we have (Y,X,W), and on a larger sample we have (Y,W).We utilize the data on W to boost the prediction of Y by X. When p is large and the subsample containing X is small, this is a p>n situation.When p is small, this is akin to the classical measurement error problem; however, ours is not the typical goal of calibrating W for use in future studies We propose to shrink the regression coefficients Beta of Y on X toward different targets that use information derived from W in the larger dataset.We compare these proposals with the classical ridge regression of Y on X, which does not use W. We also unify all of these methods as targeted ridge estimators.Finally, we propose a hybrid estimator which is a linear combination of multiple estimators of Beta. With an optimal choice of weights, the hybrid estimator balances efficiency and robustness in a data-adaptive way to theoretically yield a smaller prediction error than any of its constituents.The methods, including a fully Bayesian alternative, are evaluated via simulation studies.We also apply them to a gene-expression dataset. mRNA expression measured via quantitative real-time polymerase chain reaction is used to predict survival time in lung cancer patients, with auxiliary information from microarray technology available on a larger sample.