Duplicate document detection by template matching

Authors
Citation
Rs. Caprari, Duplicate document detection by template matching, IMAGE VIS C, 18(8), 2000, pp. 633-643
Citations number
13
Categorie Soggetti
AI Robotics and Automatic Control
Journal title
IMAGE AND VISION COMPUTING
ISSN journal
02628856 → ACNP
Volume
18
Issue
8
Year of publication
2000
Pages
633 - 643
Database
ISI
SICI code
0262-8856(20000515)18:8<633:DDDBTM>2.0.ZU;2-R
Abstract
We discuss some operational issues pertaining to the detection of duplicate s in the databases of bitmapped binary document images, and reason that eff icient and effective duplicate document detection probably needs a combinat ion of an efficient primary detector and an effective subordinate detector to be achieved. An algorithm that executes binary pattern template matching by cross-correlation is proposed as a duplicate document detection methodo logy. The template matching operation is amenable to pixel-parallel computa tion on serial architecture computers by bitwise integer operations. A desc ription of the algorithm is accompanied by a discussion of issues that aris e in its practical implementation. Duplicate detection by template matching is especially well suited to facsimile (i.e. fax) databases, in particular for detecting the single feed-multiple transmissions that often dominate t he occurrence of duplicates in fax databases. Detailed experimental results presented for fax documents demonstrate that template matching is suitable as both a primary detector when conducted with small template and search a rea sizes, and a subordinate detector when conducted with moderate template and search area sizes. (C) 2000 Elsevier Science B.V. All rights reserved.