Jr. Movellan et P. Mineiro, ROBUST SENSOR FUSION - ANALYSIS AND APPLICATION TO AUDIO-VISUAL SPEECH RECOGNITION, Machine learning, 32(2), 1998, pp. 85-100
This paper analyzes the issue of catastrophic fusion, a problem that o
ccurs in multimodal recognition systems that integrate the output from
several modules while working in non-stationary environments. For con
creteness we frame the analysis with regard to the problem of automati
c audio visual speech recognition (AVSR), but the issues at hand are v
ery general and arise in multimodal recognition systems which need to
work in a wide variety of contexts. Catastrophic fusion is said to hav
e occurred when the performance of a multimodal system is inferior to
the performance of some isolated modules, e.g., when the performance o
f the audio visual speech recognition system is inferior to that of th
e audio system alone. Catastrophic fusion arises because recognition m
odules make implicit assumptions and thus operate correctly only withi
n a certain context. Practice shows that when modules are tested in co
ntexts inconsistent with their assumptions, their influence on the fus
ed product tends to increase, with catastrophic results. We propose a
principled solution to this problem based upon Bayesian ideas of compe
titive models and inference robustification. Pie study the approach an
alytically on a classic Gaussian discrimination task and then apply it
to a realistic problem on audio visual speech recognition (AVSR) with
excellent results.