Gw. Leng et al., A DIFFERENTIAL-PROCESSING EXTRACTION APPROACH TO TEXT AND IMAGE SEGMENTATION, Engineering applications of artificial intelligence, 7(6), 1994, pp. 639-651
To efficiently store the information found in paper documents, text an
d non-text regions need to be separated. Non-text regions include half
-tone photographs and line diagrams. The text regions can be converted
(via an optical character reader) to a computer-searchable form, and
the non-text regions can be extracted and preserved in compressed form
using image-compression algorithms. In this paper, an effective syste
m for automatically segmenting a document image into regions of text a
nd non-text is proposed. The system first performs an adaptive thresho
lding to obtain a binarized image. Subsequently the binarized image is
smeared using a run-length differential algorithm. The smeared image
is then subjected to a text characteristic filter to remove error smea
ring of non-text regions. Next, baseline cumulative blocking is used t
o rectangularize the smeared region. Finally, a text block growing alg
orithm is used to block out a text sentence. The recognition of text i
s carried out on a text sentence basis.