Causal Dantzig: Fast inference in linear structural equation models with hidden variables under additive interventions

Citation
Dominik Rothenhäusler et al., Causal Dantzig: Fast inference in linear structural equation models with hidden variables under additive interventions, Annals of statistics , 47(3), 2019, pp. 1688-1722
Journal title
ISSN journal
00905364
Volume
47
Issue
3
Year of publication
2019
Pages
1688 - 1722
Database
ACNP
SICI code
Abstract
Causal inference is known to be very challenging when only observational data are available. Randomized experiments are often costly and impractical and in instrumental variable regression the number of instruments has to exceed the number of causal predictors. It was recently shown in Peters, Bühlmann and Meinshausen (2016) (J. R. Stat. Soc. Ser. B. Stat. Methodol. 78 947.1012) that causal inference for the full model is possible when data from distinct observational environments are available, exploiting that the conditional distribution of a response variable is invariant under the correct causal model. Two shortcomings of such an approach are the high computational effort for large-scale data and the assumed absence of hidden confounders. Here, we show that these two shortcomings can be addressed if one is willing to make a more restrictive assumption on the type of interventions that generate different environments. Thereby, we look at a different notion of invariance, namely inner-product invariance. By avoiding a computationally cumbersome reverse-engineering approach such as in Peters, Bühlmann and Meinshausen (2016), it allows for large-scale causal inference in linear structural equation models. We discuss identifiability conditions for the causal parameter and derive asymptotic confidence intervals in the low-dimensional setting. In the case of nonidentifiability, we show that the solution set of causal Dantzig has predictive guarantees under certain interventions. We derive finite-sample bounds in the high-dimensional setting and investigate its performance on simulated datasets.