ITA
ENG

AN INVESTIGATION OF DOCUMENTS FROM THE WORLD-WIDE-WEB

Authors

WOODRUFF A AOKI PM BREWER E GAUTHIER P ROWE LA

Citation

A. Woodruff et al., AN INVESTIGATION OF DOCUMENTS FROM THE WORLD-WIDE-WEB, Computer networks and ISDN systems, 28(7-11), 1996, pp. 963-980

Citations number

Categorie Soggetti

Computer Sciences","System Science",Telecommunications,"Engineering, Eletrical & Electronic","Computer Science Information Systems

Journal title

Computer networks and ISDN systems → ACNP

ISSN journal

01697552

Volume

Issue

7-11

Year of publication

1996

Pages

963 - 980

Database

ISI

SICI code

0169-7552(1996)28:7-11<963:AIODFT>2.0.ZU;2-K

Abstract

We report on our examination of pages from the World Wide Web. We have analyzed data collected by the Inktomi(8) Web crawler (this data curr ently comprises over 2.6 million HTML documents). We have examined man y characteristics of these documents, including: document size; number and types of tags, attributes, file extensions, protocols, and ports; the number of in-links; and the ratio of document size to the number of tags and attributes. For a more limited set of documents, we have e xamined the following: the number and types of syntax errors and reada bility scores. These data have been aggregated to create a number of r anked lists, e.g., the ten most-used tags, the ten most common HTML er rors.