This paper presents a script-independent methodology for optical character
recognition (OCR) based on the use of hidden Markov models (HMM). The featu
re extraction, training and recognition components of the system are all de
signed to be script independent. The training and recognition components we
re taken without modification from a continuous speech recognition system;
the only component that is specific to OCR is the feature extraction compon
ent. To port the system to a new language, all that is needed is text image
training data from the new language, along with ground truth which gives t
he identity of the sequences of characters along each line of each text ima
ge, without specifying the location of the characters on the image. The par
ameters of the character HMMs are estimated automatically from the training
data, without the need for laborious handwritten rules. The system does no
t require presegmentation of the data, neither at the word level nor at the
character level. Thus, the system is able to handle languages with connect
ed characters in a straightforward manner. The script independence of the s
ystem is demonstrated in three languages with different types of script: Ar
abic, English, and Chinese. The robustness of the system is further demonst
rated by testing the system on fax data. An unsupervised adaptation method
is then described to improve performance under degraded conditions.