ITA
ENG

Probe, count, and classify: Categorizing hidden-web databases

Authors

Ipeirotis, PG Gravano, L Sahami, M

Citation

Pg. Ipeirotis et al., Probe, count, and classify: Categorizing hidden-web databases, SIG RECORD, 30(2), 2001, pp. 67-78

Citations number

Categorie Soggetti

Computer Science & Engineering

Journal title

SIGMOD RECORD

ISSN journal

01635808 → ACNP

Volume

Issue

Year of publication

2001

Pages

67 - 78

Database

ISI

SICI code

0163-5808(200106)30:2<67:PCACCH>2.0.ZU;2-U

Abstract

The contents of many valuable web-accessible databases are only accessible through search interfaces and are hence invisible to traditional web "crawl ers." Recent studies have estimated the size of this "hidden web" to be 500 billion pages, while the size of the "crawlable" web is only an estimated two billion pages. Recently commercial web sites have started to manually o rganize web-accessible databases into Yahoo!-like hierarchical classificati on schemes. In this paper, we introduce a method for automating this classi fication process by using a small number of query probes. To classify a dat abase, our algorithm does not retrieve or inspect any documents or pages fr om the database, but father just exploits the number of matches that each q uery probe generates at the database in question. We have conducted an exte nsive experimental evaluation of our technique over collections of real doc uments, including over One hundred web-accessible databases. Our experiment s show that our system has law overhead and achieves high classification ac curacy across a variety of databases.