Machine simulation of human reading has been the subject of intensive resea
rch for almost three decades. A large number of research papers and reports
have already been published on Latin, Chinese and Japanese characters. How
ever, little work has been conducted on the automatic recognition of Arabic
in both on-line and off-line, has been achieved towards the automatic reco
gnition of Arabic characters. This is a result of the lack of adequate supp
ort in terms of funding, and other utilities such as Arabic text databases,
dictionaries, etc., and of course because of the cursive nature of its wri
ting rules, and this problem is still an open research field. This paper pr
esents a new technique for the recognition of Arabic text using the C4.5 ma
chine learning system. The advantage of machine learning are twofold: it ca
n generalize over the large degree of variations between different fonts an
d writing style and recognition rules can be constructed by examples. The t
echnique can be divided into three major steps. The first step is digitizat
ion and pre-processing to create connected component, detect the skew of a
document image and correct it. Second, feature extraction. where global fea
tures of the input Arabic word is used to extract features such as number o
f subwords, number of peaks within the subword, number and position of the
complementary character etc., to avoid the difficulty of segmentation stage
. Finally, machine learning C4.5 is used to generate a decision tree for cl
assifying each word. The system was tested with 1000 Arabic words with diff
erent fonts (each word has 15 samples) and the correct average recognition
rate obtained using cross-validation was 92%. (C) 2000 Pattern Recognition
Society. Published by Elsevier Science Ltd. All rights reserved.