Two previous studies, one done at Stanford in 1997 based on data collected
by the Google search engine, and one done at Digital in 1996 based on AltaV
ista data, revealed that almost a third of the Web consists of duplicate pa
ges. Both studies identified mirroring, that is, the systematic replication
of content over a pair of hosts, as the principal cause of duplication, bu
t did not further investigate this phenomenon. The main aim of this paper i
s to present a clearer picture of mirroring on the Web. As input we used a
set of 179 million URLs found during a Web crawl done in the summer of 1998
. We looked at all hosts with more than 100 URLs in our input (about 238,00
0), and discovered that about 10% were mirrored to varying degrees. The pap
er presents data about the prevalence of mirroring based on a mirroring cla
ssification scheme that we define. There are numerous reasons for mirroring
: technical (e.g., to improve access time), commercial (e.g., different int
ermediaries offering the same products), cultural (e.g., same content in tw
o languages), social (e.g., sharing of research data), and so forth. Althou
gh we have not done a exhaustive study of the causes of replication, we dis
cuss and provide examples for several representative cases. Our technique f
or detecting mirrored hosts from large sets of collected URLs depends mostl
y on the syntactic analysis of URL strings, and requires retrieval and cont
ent analysis only for a small number of pages. We are able to detect both p
artial and total mirroring, and handle cases where the content is not byte-
wise identical. Furthermore, our technique is computationally very efficien
t and does not assume that the initial set of URLs gathered from each host
is comprehensive. Hence, this approach has practical uses beyond our study,
and can be applied in other settings. For instance, for Web crawlers and c
aching proxies, detecting mirrors can be valuable to avoid redundant fetchi
ng, and knowledge of mirroring can be used to compensate for broken links.
(C) 1999 Published by Elsevier Science B.V. All rights reserved.