We report on our examination of pages from the World Wide Web. We have
analyzed data collected by the Inktomi(8) Web crawler (this data curr
ently comprises over 2.6 million HTML documents). We have examined man
y characteristics of these documents, including: document size; number
and types of tags, attributes, file extensions, protocols, and ports;
the number of in-links; and the ratio of document size to the number
of tags and attributes. For a more limited set of documents, we have e
xamined the following: the number and types of syntax errors and reada
bility scores. These data have been aggregated to create a number of r
anked lists, e.g., the ten most-used tags, the ten most common HTML er
rors.