FULL-TEXT INDEXING OF NON-TEXTUAL RESOURCES

Authors
Citation
D. Byers, FULL-TEXT INDEXING OF NON-TEXTUAL RESOURCES, Computer networks and ISDN systems, 30(1-7), 1998, pp. 141-148
Citations number
6
Categorie Soggetti
Computer Science Information Systems",Telecommunications,"Engineering, Eletrical & Electronic","Computer Science Information Systems
ISSN journal
01697552
Volume
30
Issue
1-7
Year of publication
1998
Pages
141 - 148
Database
ISI
SICI code
0169-7552(1998)30:1-7<141:FIONR>2.0.ZU;2-B
Abstract
Full-text indexing of resources on the World Wide Web is limited to si mple content types, such as HTML and plain text. More complex content types, such as Postscript, PDF and proprietary word-processing formats are excluded, despite the fact that such documents are usually rich i n content. The reason for excluding these types of resources is simply that it would be too expensive and too difficult to attempt to extrac t a textual representation from them. The operator of a search engine is simply not motivated to expend the additional resources that would be needed to handle such documents. The gain would be fairly small, an d search engines are extremely popular even when they are limited to H TML and plain text documents. The situation is quite different from th e point-of-view of the content provider. A site may have significant a mounts of its content in non-textual documents, but despite this the c ontent provider may want to have the documents indexed in normal searc h engines. In this paper we present several server-side solutions that allow existing indexing software to index the textual representation of non-textual resources. (C) 1998 Published by Elsevier Science B.V. All rights reserved.