ITA
ENG

EXTRACTING TEXT FROM POSTSCRIPT

Authors

NEVILLMANNING CG REED T WITTEN IH

Citation

Cg. Nevillmanning et al., EXTRACTING TEXT FROM POSTSCRIPT, Software, practice & experience, 28(5), 1998, pp. 481-491

Citations number

Categorie Soggetti

Computer Science Software Graphycs Programming","Computer Science Software Graphycs Programming

Journal title

Software, practice & experience → ACNP

ISSN journal

00380644

Volume

Issue

Year of publication

1998

Pages

481 - 491

Database

ISI

SICI code

0038-0644(1998)28:5<481:ETFP>2.0.ZU;2-C

Abstract

We show how to extract plain text from PostScript files. A textual sca n is inadequate because PostScript interpreters can generate character s on the page that do not appear in the source file. Furthermore, word and line breaks are implicit in the graphical rendition, and must be inferred from the positioning of word fragments. We present a robust t echnique for extracting text and recognizing words and paragraphs. The method uses a standard PostScript interpreter but redefines several P ostScript operators, and simple heuristics are employed to locate word and line breaks. The scheme has been used to create a full-text index , and plain-text versions, of 40,000 technical reports (34 Gbytes of P ostScript), Other text-extraction systems are reviewed: none offer the same combination of robustness and simplicity. (C) 1998 John Wiley & Sons, Ltd.