We show how to extract plain text from PostScript files. A textual sca
n is inadequate because PostScript interpreters can generate character
s on the page that do not appear in the source file. Furthermore, word
and line breaks are implicit in the graphical rendition, and must be
inferred from the positioning of word fragments. We present a robust t
echnique for extracting text and recognizing words and paragraphs. The
method uses a standard PostScript interpreter but redefines several P
ostScript operators, and simple heuristics are employed to locate word
and line breaks. The scheme has been used to create a full-text index
, and plain-text versions, of 40,000 technical reports (34 Gbytes of P
ostScript), Other text-extraction systems are reviewed: none offer the
same combination of robustness and simplicity. (C) 1998 John Wiley &
Sons, Ltd.