This paper introduces a new statistical approach to automatically partition
ing text into coherent segments. The approach is based on a technique that
incrementally builds an exponential model to extract features that are corr
elated with the presence of boundaries in labeled training text. The models
use two classes of features: topicality features that use adaptive languag
e models in a novel way to detect broad changes of topic, and cue-word feat
ures that detect occurrences of specific words, which may he domain-specifi
c, that tend to be used near segment boundaries, Assessment of our approach
on quantitative and qualitative grounds demonstrates its effectiveness in
two very different domains, Wall Street Journal news articles and televisio
n broadcast news story transcripts. Quantitative results on these domains a
re presented using a new probabilistically motivated error metric, which co
mbines precision and recall in a natural and flexible way. This metric is u
sed to make a quantitative assessment of the relative contributions of the
different feature types, as well as a comparison with decision trees and pr
eviously proposed text segmentation algorithms.