A system is presented for creating a summary indicating the contents o
f an imaged document. The summary is composed from selected regions ex
tracted from the imaged document. The regions may include sentences, k
ey phrases, headings, and figures. The extracts are identified without
the use of optical character recognition. The imaged document is firs
t processed to identify the word-bounding boxes, the reading order of
words, and the location of sentence and paragraph boundaries in the te
xt. The word-bounding boxes are grouped into equivalence classes to mi
mic the terms in a text document. Equivalence classes representing con
tent words are identified, and key phrases are identified from the set
of content words. Summary sentences are selected using a statisticall
y based classifier applied to a set of discrete sentence features. Eva
luation of sentence selection against a set of abstracts created by a
professional abstracting company is given. (C) 1998 Academic Press.