We present a technique for identifying repetitive information transfers and
use it to analyze the redundancy of network traffic. Our insight is that d
ynamic content, streaming media and other traffic that is not caught by tod
ay's Web caches is nonetheless likely to derive from similar information. W
e have therefore adapted similarity detection techniques to the problem of
designing a system to eliminate redundant transfers. We identify repeated b
yte ranges between packets to avoid retransmitting the redundant data.
We find a high level of redundancy and are able to detect repetition that W
eb proxy caches are not. In our traces, after Web proxy caching has been ap
plied, an additional 39% of the original volume of Web traffic is found to
be redundant Moreover, because our technique makes no assumptions about HTT
P protocol syntax or caching semantics, it provides immediate benefits for
other types of content, such as streaming media, FTP traffic, news and mail
.