Bayesian data mining in large frequency tables, with an application to theFDA spontaneous reporting system

Authors
Citation
W. Dumouchel, Bayesian data mining in large frequency tables, with an application to theFDA spontaneous reporting system, AM STATISTN, 53(3), 1999, pp. 177-190
Citations number
27
Categorie Soggetti
Mathematics
Journal title
AMERICAN STATISTICIAN
ISSN journal
00031305 → ACNP
Volume
53
Issue
3
Year of publication
1999
Pages
177 - 190
Database
ISI
SICI code
0003-1305(199908)53:3<177:BDMILF>2.0.ZU;2-B
Abstract
A common data mining task is the search for associations in large databases . Here we consider the search for "interestingly large" counts in a large f requency table, having millions of cells, most of which have an observed fr equency of 0 or 1, We first construct a baseline or null hypothesis expecte d frequency for each cell, and then suggest and compare screening criteria for ranking the cell deviations of observed from expected count. A criterio n based on the re suits of fitting an empirical Bayes model to the cell cou nts is recommended. An example compares these criteria for searching the FD A Spontaneous Reporting System database maintained by the Division of Pharm acovigilance and Epidemiology. In the example, each cell count is the numbe r of reports combining one of 1,398 drugs with one of 952 adverse events (t otal of cell counts = 4.9 million), and the problem is to screen the drug-e vent combinations for possible further investigation.