ITA
ENG

A comparison of techniques to find mirrored hosts on the WWW

Authors

Bharat, K Broder, A Dean, J Henzinger, MR

Citation

K. Bharat et al., A comparison of techniques to find mirrored hosts on the WWW, J AM S INFO, 51(12), 2000, pp. 1114-1122

Citations number

Categorie Soggetti

Library & Information Science

Journal title

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE

ISSN journal

00028231 → ACNP

Volume

Issue

Year of publication

2000

Pages

1114 - 1122

Database

ISI

SICI code

0002-8231(200010)51:12<1114:ACOTTF>2.0.ZU;2-C

Abstract

We compare several algorithms for identifying mirrored hosts on the World W ide Web. The algorithms operate on the basis of URL strings and linkage dat a: the type of information about Web pages easily available from Web proxie s and crawlers. Identification of mirrored hosts can improve Web-based info rmation retrieval in several ways: first, by identifying mirrored hosts, se arch engines can avoid storing and returning duplicate documents. Second, s everal new information retrieval techniques for the Web make inferences bas ed on the explicit links among hypertext documents-mirroring perturbs their graph model and degrades performance. Third, mirroring information can be used to redirect users to alternate mirror sites to compensate for various failures, and can thus improve the performance of Web browsers and proxies. We evaluated four classes of "top-down" algorithms for detecting mirrored host pairs (that is, algorithms that are based on page attributes such as U RL, IP address, and hyperlinks between pages, and not on the page content) on a collection of 140 million URLs (on 230,000 hosts) and their associated connectivity information. Our best approach is one which combines five alg orithms and achieved a precision of 0.57 for a recall of 0.86 considering 1 00,000 ranked host pairs.