Recognition of printed arabic text based on global features and decision tree learning techniques

Authors
Citation
A. Amin, Recognition of printed arabic text based on global features and decision tree learning techniques, PATT RECOG, 33(8), 2000, pp. 1309-1323
Citations number
44
Categorie Soggetti
AI Robotics and Automatic Control
Journal title
PATTERN RECOGNITION
ISSN journal
00313203 → ACNP
Volume
33
Issue
8
Year of publication
2000
Pages
1309 - 1323
Database
ISI
SICI code
0031-3203(200008)33:8<1309:ROPATB>2.0.ZU;2-D
Abstract
Machine simulation of human reading has been the subject of intensive resea rch for almost three decades. A large number of research papers and reports have already been published on Latin, Chinese and Japanese characters. How ever, little work has been conducted on the automatic recognition of Arabic in both on-line and off-line, has been achieved towards the automatic reco gnition of Arabic characters. This is a result of the lack of adequate supp ort in terms of funding, and other utilities such as Arabic text databases, dictionaries, etc., and of course because of the cursive nature of its wri ting rules, and this problem is still an open research field. This paper pr esents a new technique for the recognition of Arabic text using the C4.5 ma chine learning system. The advantage of machine learning are twofold: it ca n generalize over the large degree of variations between different fonts an d writing style and recognition rules can be constructed by examples. The t echnique can be divided into three major steps. The first step is digitizat ion and pre-processing to create connected component, detect the skew of a document image and correct it. Second, feature extraction. where global fea tures of the input Arabic word is used to extract features such as number o f subwords, number of peaks within the subword, number and position of the complementary character etc., to avoid the difficulty of segmentation stage . Finally, machine learning C4.5 is used to generate a decision tree for cl assifying each word. The system was tested with 1000 Arabic words with diff erent fonts (each word has 15 samples) and the correct average recognition rate obtained using cross-validation was 92%. (C) 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved.