Many web documents (such as JAVA FAQs) are being replicated on the Internet
. Often entire document collections (such as hyperlinked Linux manuals) are
being replicated many times. In this paper, we make the case for identifyi
ng replicated documents and collections to improve web crawlers, archivers,
and ranking functions used in search engines. The paper describes how to e
fficiently identify replicated documents and hyperlinked document collectio
ns. The challenge is to identify these replicas from an input data set of s
everal tens of millions of web pages and several hundreds of gigabytes of t
extual data. We also present two real;life case studies where we used repli
cation information to improve a crawler and a search engine. We report thes
e results for a data set of 25 million web pages (about 150 gigabytes of HT
ML data) crawled from the web.