SUMMARY OF IMAGED DOCUMENTS WITHOUT OCR

Citation
Fr. Chen et Ds. Bloomberg, SUMMARY OF IMAGED DOCUMENTS WITHOUT OCR, Computer vision and image understanding, 70(3), 1998, pp. 307-320
Citations number
16
Categorie Soggetti
Computer Science Software Graphycs Programming","Computer Science Software Graphycs Programming
ISSN journal
10773142
Volume
70
Issue
3
Year of publication
1998
Pages
307 - 320
Database
ISI
SICI code
1077-3142(1998)70:3<307:SOIDWO>2.0.ZU;2-Q
Abstract
A system is presented for creating a summary indicating the contents o f an imaged document. The summary is composed from selected regions ex tracted from the imaged document. The regions may include sentences, k ey phrases, headings, and figures. The extracts are identified without the use of optical character recognition. The imaged document is firs t processed to identify the word-bounding boxes, the reading order of words, and the location of sentence and paragraph boundaries in the te xt. The word-bounding boxes are grouped into equivalence classes to mi mic the terms in a text document. Equivalence classes representing con tent words are identified, and key phrases are identified from the set of content words. Summary sentences are selected using a statisticall y based classifier applied to a set of discrete sentence features. Eva luation of sentence selection against a set of abstracts created by a professional abstracting company is given. (C) 1998 Academic Press.