ITA
ENG

Multilingual machine printed OCR

Authors

Natarajan, P Lu, ZD Schwartz, R Bazzi, I Makhoul, J

Citation

P. Natarajan et al., Multilingual machine printed OCR, INT J PATT, 15(1), 2001, pp. 43-63

Citations number

Categorie Soggetti

AI Robotics and Automatic Control

Journal title

INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE

ISSN journal

02180014 → ACNP

Volume

Issue

Year of publication

2001

Pages

43 - 63

Database

ISI

SICI code

0218-0014(200102)15:1<43:MMPO>2.0.ZU;2-V

Abstract

This paper presents a script-independent methodology for optical character recognition (OCR) based on the use of hidden Markov models (HMM). The featu re extraction, training and recognition components of the system are all de signed to be script independent. The training and recognition components we re taken without modification from a continuous speech recognition system; the only component that is specific to OCR is the feature extraction compon ent. To port the system to a new language, all that is needed is text image training data from the new language, along with ground truth which gives t he identity of the sequences of characters along each line of each text ima ge, without specifying the location of the characters on the image. The par ameters of the character HMMs are estimated automatically from the training data, without the need for laborious handwritten rules. The system does no t require presegmentation of the data, neither at the word level nor at the character level. Thus, the system is able to handle languages with connect ed characters in a straightforward manner. The script independence of the s ystem is demonstrated in three languages with different types of script: Ar abic, English, and Chinese. The robustness of the system is further demonst rated by testing the system on fax data. An unsupervised adaptation method is then described to improve performance under degraded conditions.