ITA
ENG

FULL-TEXT INDEXING OF NON-TEXTUAL RESOURCES

Authors

BYERS D

Citation

D. Byers, FULL-TEXT INDEXING OF NON-TEXTUAL RESOURCES, Computer networks and ISDN systems, 30(1-7), 1998, pp. 141-148

Citations number

Categorie Soggetti

Computer Science Information Systems",Telecommunications,"Engineering, Eletrical & Electronic","Computer Science Information Systems

Journal title

Computer networks and ISDN systems → ACNP

ISSN journal

01697552

Volume

Issue

1-7

Year of publication

1998

Pages

141 - 148

Database

ISI

SICI code

0169-7552(1998)30:1-7<141:FIONR>2.0.ZU;2-B

Abstract

Full-text indexing of resources on the World Wide Web is limited to si mple content types, such as HTML and plain text. More complex content types, such as Postscript, PDF and proprietary word-processing formats are excluded, despite the fact that such documents are usually rich i n content. The reason for excluding these types of resources is simply that it would be too expensive and too difficult to attempt to extrac t a textual representation from them. The operator of a search engine is simply not motivated to expend the additional resources that would be needed to handle such documents. The gain would be fairly small, an d search engines are extremely popular even when they are limited to H TML and plain text documents. The situation is quite different from th e point-of-view of the content provider. A site may have significant a mounts of its content in non-textual documents, but despite this the c ontent provider may want to have the documents indexed in normal searc h engines. In this paper we present several server-side solutions that allow existing indexing software to index the textual representation of non-textual resources. (C) 1998 Published by Elsevier Science B.V. All rights reserved.