Automatic cataloguing and searching for retrospective data by use of OCR text

Authors
Citation
Yh. Tseng, Automatic cataloguing and searching for retrospective data by use of OCR text, J AM SOC IN, 52(5), 2001, pp. 378-390
Citations number
26
Categorie Soggetti
Library & Information Science
Journal title
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY
ISSN journal
15322882 → ACNP
Volume
52
Issue
5
Year of publication
2001
Pages
378 - 390
Database
ISI
SICI code
1532-2882(200103)52:5<378:ACASFR>2.0.ZU;2-T
Abstract
This article describes our efforts in supporting information retrieval from OCR degraded text, In particular, we report our approach to an automatic c ataloging and searching contest for books in multiple languages. In this co ntest, 500 books in English, German, French, and Italian published during t he 1770s to 1970s are scanned into images and OCRed to digital text. The go al is to use only automatic ways to extract information for sophisticated s earching. We adopted the vector space retrieval model, an n-gram indexing m ethod, and a special weighting scheme to tackle this problem. Although the performance by this approach is slightly inferior to the best approach, whi ch is mainly based on regular expression match, one advantage of our approa ch is that it is less language dependent and less layout sensitive, thus is readily applicable to other languages and document collections. Problems o f OCR text retrieval for some Asian languages are also discussed in this ar ticle, and solutions are suggested.