The art of data augmentation

Citation
Da. Van Dyk et Xl. Meng, The art of data augmentation, J COMPU G S, 10(1), 2001, pp. 1-50
Citations number
66
Categorie Soggetti
Mathematics
Journal title
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS
ISSN journal
10618600 → ACNP
Volume
10
Issue
1
Year of publication
2001
Pages
1 - 50
Database
ISI
SICI code
1061-8600(200103)10:1<1:TAODA>2.0.ZU;2-6
Abstract
The term data augmentation refers to methods for constructing iterative opt imization or sampling algorithms via the introduction of unobserved data or latent variables. For deterministic algorithms, the method was popularized in the general statistical community by the seminal article by Dempster, L aird, and Rubin on the EM algorithm for maximizing a likelihood function or , more generally, a posterior density. For stochastic algorithms, the metho d was popularized in the statistical literature by Tanner and Wong's Data A ugmentation algorithm for posterior sampling and in the physics Literature by Swendsen and Wang's algorithm for sampling from the Ising and Potts mode ls and their generalizations; in the physics literature, the method of data augmentation is referred to as the method of auxiliary variables. Data aug mentation schemes were used by Tanner and Wong to make simulation feasible and simple, while auxiliary variables were adopted by Swendsen and Wang to improve the speed of iterative simulation. In general, however, constructin g data augmentation schemes that result in both simple and fast algorithms is a matter of art in that successful strategies vary greatly with the (obs erved-data) models being considered. After an overview of data augmentation /auxiliary variables and some recent developments in methods for constructi ng such efficient data augmentation schemes, we introduce an effective sear ch strategy that combines the ideas of marginal augmentation and conditiona l augmentation, together with a deterministic approximation method for sele cting good augmentation schemes. We then apply this strategy to three commo n classes of models (specifically, multivariate t, probit regression, and m ixed-effects models) to obtain efficient Markov chain Monte Carlo algorithm s for posterior sampling. We provide theoretical and empirical evidence tha t the resulting algorithms, while requiring similar programming effort, can show dramatic improvement over the Gibbs samplers commonly used for these models in practice. A key feature of all these new algorithms is that they are positive recurrent subchains of nonpositive recurrent Markov chains con structed in larger spaces.