This article describes our efforts in supporting information retrieval from
OCR degraded text, In particular, we report our approach to an automatic c
ataloging and searching contest for books in multiple languages. In this co
ntest, 500 books in English, German, French, and Italian published during t
he 1770s to 1970s are scanned into images and OCRed to digital text. The go
al is to use only automatic ways to extract information for sophisticated s
earching. We adopted the vector space retrieval model, an n-gram indexing m
ethod, and a special weighting scheme to tackle this problem. Although the
performance by this approach is slightly inferior to the best approach, whi
ch is mainly based on regular expression match, one advantage of our approa
ch is that it is less language dependent and less layout sensitive, thus is
readily applicable to other languages and document collections. Problems o
f OCR text retrieval for some Asian languages are also discussed in this ar
ticle, and solutions are suggested.