Multilingual machine printed OCR

Citation
P. Natarajan et al., Multilingual machine printed OCR, INT J PATT, 15(1), 2001, pp. 43-63
Citations number
51
Categorie Soggetti
AI Robotics and Automatic Control
Journal title
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE
ISSN journal
02180014 → ACNP
Volume
15
Issue
1
Year of publication
2001
Pages
43 - 63
Database
ISI
SICI code
0218-0014(200102)15:1<43:MMPO>2.0.ZU;2-V
Abstract
This paper presents a script-independent methodology for optical character recognition (OCR) based on the use of hidden Markov models (HMM). The featu re extraction, training and recognition components of the system are all de signed to be script independent. The training and recognition components we re taken without modification from a continuous speech recognition system; the only component that is specific to OCR is the feature extraction compon ent. To port the system to a new language, all that is needed is text image training data from the new language, along with ground truth which gives t he identity of the sequences of characters along each line of each text ima ge, without specifying the location of the characters on the image. The par ameters of the character HMMs are estimated automatically from the training data, without the need for laborious handwritten rules. The system does no t require presegmentation of the data, neither at the word level nor at the character level. Thus, the system is able to handle languages with connect ed characters in a straightforward manner. The script independence of the s ystem is demonstrated in three languages with different types of script: Ar abic, English, and Chinese. The robustness of the system is further demonst rated by testing the system on fax data. An unsupervised adaptation method is then described to improve performance under degraded conditions.