ITA
ENG

Finding replicated web collections

Authors

Cho, JH Shivakumar, N Garcia-Molina, H

Citation

Jh. Cho et al., Finding replicated web collections, SIG RECORD, 29(2), 2000, pp. 355-366

Citations number

Categorie Soggetti

Computer Science & Engineering

Journal title

SIGMOD RECORD

ISSN journal

01635808 → ACNP

Volume

Issue

Year of publication

2000

Pages

355 - 366

Database

ISI

SICI code

0163-5808(200006)29:2<355:FRWC>2.0.ZU;2-E

Abstract

Many web documents (such as JAVA FAQs) are being replicated on the Internet . Often entire document collections (such as hyperlinked Linux manuals) are being replicated many times. In this paper, we make the case for identifyi ng replicated documents and collections to improve web crawlers, archivers, and ranking functions used in search engines. The paper describes how to e fficiently identify replicated documents and hyperlinked document collectio ns. The challenge is to identify these replicas from an input data set of s everal tens of millions of web pages and several hundreds of gigabytes of t extual data. We also present two real;life case studies where we used repli cation information to improve a crawler and a search engine. We report thes e results for a data set of 25 million web pages (about 150 gigabytes of HT ML data) crawled from the web.