Probe, count, and classify: Categorizing hidden-web databases

Citation
Pg. Ipeirotis et al., Probe, count, and classify: Categorizing hidden-web databases, SIG RECORD, 30(2), 2001, pp. 67-78
Citations number
31
Categorie Soggetti
Computer Science & Engineering
Journal title
SIGMOD RECORD
ISSN journal
01635808 → ACNP
Volume
30
Issue
2
Year of publication
2001
Pages
67 - 78
Database
ISI
SICI code
0163-5808(200106)30:2<67:PCACCH>2.0.ZU;2-U
Abstract
The contents of many valuable web-accessible databases are only accessible through search interfaces and are hence invisible to traditional web "crawl ers." Recent studies have estimated the size of this "hidden web" to be 500 billion pages, while the size of the "crawlable" web is only an estimated two billion pages. Recently commercial web sites have started to manually o rganize web-accessible databases into Yahoo!-like hierarchical classificati on schemes. In this paper, we introduce a method for automating this classi fication process by using a small number of query probes. To classify a dat abase, our algorithm does not retrieve or inspect any documents or pages fr om the database, but father just exploits the number of matches that each q uery probe generates at the database in question. We have conducted an exte nsive experimental evaluation of our technique over collections of real doc uments, including over One hundred web-accessible databases. Our experiment s show that our system has law overhead and achieves high classification ac curacy across a variety of databases.