EXTRACTING TEXT FROM POSTSCRIPT

Citation
Cg. Nevillmanning et al., EXTRACTING TEXT FROM POSTSCRIPT, Software, practice & experience, 28(5), 1998, pp. 481-491
Citations number
6
Categorie Soggetti
Computer Science Software Graphycs Programming","Computer Science Software Graphycs Programming
ISSN journal
00380644
Volume
28
Issue
5
Year of publication
1998
Pages
481 - 491
Database
ISI
SICI code
0038-0644(1998)28:5<481:ETFP>2.0.ZU;2-C
Abstract
We show how to extract plain text from PostScript files. A textual sca n is inadequate because PostScript interpreters can generate character s on the page that do not appear in the source file. Furthermore, word and line breaks are implicit in the graphical rendition, and must be inferred from the positioning of word fragments. We present a robust t echnique for extracting text and recognizing words and paragraphs. The method uses a standard PostScript interpreter but redefines several P ostScript operators, and simple heuristics are employed to locate word and line breaks. The scheme has been used to create a full-text index , and plain-text versions, of 40,000 technical reports (34 Gbytes of P ostScript), Other text-extraction systems are reviewed: none offer the same combination of robustness and simplicity. (C) 1998 John Wiley & Sons, Ltd.