Statistical models for text segmentation

Citation
D. Beeferman et al., Statistical models for text segmentation, MACH LEARN, 34(1-3), 1999, pp. 177-210
Citations number
25
Categorie Soggetti
AI Robotics and Automatic Control
Journal title
MACHINE LEARNING
ISSN journal
08856125 → ACNP
Volume
34
Issue
1-3
Year of publication
1999
Pages
177 - 210
Database
ISI
SICI code
0885-6125(199902)34:1-3<177:SMFTS>2.0.ZU;2-Y
Abstract
This paper introduces a new statistical approach to automatically partition ing text into coherent segments. The approach is based on a technique that incrementally builds an exponential model to extract features that are corr elated with the presence of boundaries in labeled training text. The models use two classes of features: topicality features that use adaptive languag e models in a novel way to detect broad changes of topic, and cue-word feat ures that detect occurrences of specific words, which may he domain-specifi c, that tend to be used near segment boundaries, Assessment of our approach on quantitative and qualitative grounds demonstrates its effectiveness in two very different domains, Wall Street Journal news articles and televisio n broadcast news story transcripts. Quantitative results on these domains a re presented using a new probabilistically motivated error metric, which co mbines precision and recall in a natural and flexible way. This metric is u sed to make a quantitative assessment of the relative contributions of the different feature types, as well as a comparison with decision trees and pr eviously proposed text segmentation algorithms.