STRUCTURE RECOGNITION AND INFORMATION EXTRACTION FROM TABULAR DOCUMENTS

Citation
S. Chandran et al., STRUCTURE RECOGNITION AND INFORMATION EXTRACTION FROM TABULAR DOCUMENTS, International journal of imaging systems and technology, 7(4), 1996, pp. 289-303
Citations number
11
Categorie Soggetti
Optics,"Engineering, Eletrical & Electronic
ISSN journal
08999457
Volume
7
Issue
4
Year of publication
1996
Pages
289 - 303
Database
ISI
SICI code
0899-9457(1996)7:4<289:SRAIEF>2.0.ZU;2-N
Abstract
We present a system for the extraction of the structural information o f a table from its image. Following the initial binarization and deske wing operations, the image is scanned to extract all horizontal and ve rtical lines that may be present. The table's dimensions are estimated based on these lines. Unlike other systems, the procedure described h ere does not depend on the sole existence of lines to mark the item bl ocks. White streams are recognized in both the horizontal and vertical directions as substitutes for any missing demarcation lines. A struct ure interpretation procedure uses the extracted demarcation informatio n to identify each of the item blocks in the table. Subsequently, the interrelations of these item blocks are used to recognize the structur e of the tabulated data. The interpretation can be done for one-dimens ional as well as two-dimensional tables. interpretation of the tabular document involves character recognition, which in turn depends on the structure of the table. The above procedure to extract the structural information of the tabular document can be used to extract useful inf ormation from different types of tabular drawings. In this article, we focus our attention on interpreting telephone company central office drawings. These drawings contain additional information in the form of crossed-out entries and repeated entries, which must be detected and recognized to interpret the document completely. Hence, after extracti ng the basic structure of the drawing, the additional information is e xtracted and cell block location is obtained in order to develop a dat a base representing the tabular document. The telephone company drawin gs are very large in size, resulting in images as large as 15,000 x 10 ,000 pixels. Thus, designing efficient and fast algorithms is an impor tant criterion in this research. (C) 1996 John Wiley & Sons, Inc.