Mirror, mirror on the Web: a study of host pairs with replicated content

Citation
K. Bharat et A. Broder, Mirror, mirror on the Web: a study of host pairs with replicated content, COMPUT NET, 31(11-16), 1999, pp. 1579-1590
Citations number
16
Categorie Soggetti
Information Tecnology & Communication Systems
Journal title
COMPUTER NETWORKS-THE INTERNATIONAL JOURNAL OF COMPUTER AND TELECOMMUNICATIONS NETWORKING
ISSN journal
13891286 → ACNP
Volume
31
Issue
11-16
Year of publication
1999
Pages
1579 - 1590
Database
ISI
SICI code
1389-1286(19990517)31:11-16<1579:MMOTWA>2.0.ZU;2-A
Abstract
Two previous studies, one done at Stanford in 1997 based on data collected by the Google search engine, and one done at Digital in 1996 based on AltaV ista data, revealed that almost a third of the Web consists of duplicate pa ges. Both studies identified mirroring, that is, the systematic replication of content over a pair of hosts, as the principal cause of duplication, bu t did not further investigate this phenomenon. The main aim of this paper i s to present a clearer picture of mirroring on the Web. As input we used a set of 179 million URLs found during a Web crawl done in the summer of 1998 . We looked at all hosts with more than 100 URLs in our input (about 238,00 0), and discovered that about 10% were mirrored to varying degrees. The pap er presents data about the prevalence of mirroring based on a mirroring cla ssification scheme that we define. There are numerous reasons for mirroring : technical (e.g., to improve access time), commercial (e.g., different int ermediaries offering the same products), cultural (e.g., same content in tw o languages), social (e.g., sharing of research data), and so forth. Althou gh we have not done a exhaustive study of the causes of replication, we dis cuss and provide examples for several representative cases. Our technique f or detecting mirrored hosts from large sets of collected URLs depends mostl y on the syntactic analysis of URL strings, and requires retrieval and cont ent analysis only for a small number of pages. We are able to detect both p artial and total mirroring, and handle cases where the content is not byte- wise identical. Furthermore, our technique is computationally very efficien t and does not assume that the initial set of URLs gathered from each host is comprehensive. Hence, this approach has practical uses beyond our study, and can be applied in other settings. For instance, for Web crawlers and c aching proxies, detecting mirrors can be valuable to avoid redundant fetchi ng, and knowledge of mirroring can be used to compensate for broken links. (C) 1999 Published by Elsevier Science B.V. All rights reserved.