Finding replicated web collections

Citation
Jh. Cho et al., Finding replicated web collections, SIG RECORD, 29(2), 2000, pp. 355-366
Citations number
12
Categorie Soggetti
Computer Science & Engineering
Journal title
SIGMOD RECORD
ISSN journal
01635808 → ACNP
Volume
29
Issue
2
Year of publication
2000
Pages
355 - 366
Database
ISI
SICI code
0163-5808(200006)29:2<355:FRWC>2.0.ZU;2-E
Abstract
Many web documents (such as JAVA FAQs) are being replicated on the Internet . Often entire document collections (such as hyperlinked Linux manuals) are being replicated many times. In this paper, we make the case for identifyi ng replicated documents and collections to improve web crawlers, archivers, and ranking functions used in search engines. The paper describes how to e fficiently identify replicated documents and hyperlinked document collectio ns. The challenge is to identify these replicas from an input data set of s everal tens of millions of web pages and several hundreds of gigabytes of t extual data. We also present two real;life case studies where we used repli cation information to improve a crawler and a search engine. We report thes e results for a data set of 25 million web pages (about 150 gigabytes of HT ML data) crawled from the web.