This paper describes a robust system for information extraction (IE) from s
poken language data. The system extends previous hidden Markov model (HMM)
work in IE, using a state topology designed for explicit modeling of variab
le-length phrases and class-based statistical language model smoothing to p
roduce state-of-the-art performance for a wide range of speech error rates.
Experiments on broadcast news data show that the system performs well with
temporal and source differences in the data. In addition, strategies for i
ntegrating word-level confidence estimates into the model are introduced, s
howing improved performance by using a generic error token for incorrectly
recognized words in the training data and low confidence words in the test
data. (C) 2000 Elsevier Science B.V. All rights reserved.