ITA
ENG

Automatic cataloguing and searching for retrospective data by use of OCR text

Authors

Tseng, YH

Citation

Yh. Tseng, Automatic cataloguing and searching for retrospective data by use of OCR text, J AM SOC IN, 52(5), 2001, pp. 378-390

Citations number

Categorie Soggetti

Library & Information Science

Journal title

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY

ISSN journal

15322882 → ACNP

Volume

Issue

Year of publication

2001

Pages

378 - 390

Database

ISI

SICI code

1532-2882(200103)52:5<378:ACASFR>2.0.ZU;2-T

Abstract

This article describes our efforts in supporting information retrieval from OCR degraded text, In particular, we report our approach to an automatic c ataloging and searching contest for books in multiple languages. In this co ntest, 500 books in English, German, French, and Italian published during t he 1770s to 1970s are scanned into images and OCRed to digital text. The go al is to use only automatic ways to extract information for sophisticated s earching. We adopted the vector space retrieval model, an n-gram indexing m ethod, and a special weighting scheme to tackle this problem. Although the performance by this approach is slightly inferior to the best approach, whi ch is mainly based on regular expression match, one advantage of our approa ch is that it is less language dependent and less layout sensitive, thus is readily applicable to other languages and document collections. Problems o f OCR text retrieval for some Asian languages are also discussed in this ar ticle, and solutions are suggested.