W. Dumouchel, Bayesian data mining in large frequency tables, with an application to theFDA spontaneous reporting system, AM STATISTN, 53(3), 1999, pp. 177-190
A common data mining task is the search for associations in large databases
. Here we consider the search for "interestingly large" counts in a large f
requency table, having millions of cells, most of which have an observed fr
equency of 0 or 1, We first construct a baseline or null hypothesis expecte
d frequency for each cell, and then suggest and compare screening criteria
for ranking the cell deviations of observed from expected count. A criterio
n based on the re suits of fitting an empirical Bayes model to the cell cou
nts is recommended. An example compares these criteria for searching the FD
A Spontaneous Reporting System database maintained by the Division of Pharm
acovigilance and Epidemiology. In the example, each cell count is the numbe
r of reports combining one of 1,398 drugs with one of 952 adverse events (t
otal of cell counts = 4.9 million), and the problem is to screen the drug-e
vent combinations for possible further investigation.