Full-text indexing of resources on the World Wide Web is limited to si
mple content types, such as HTML and plain text. More complex content
types, such as Postscript, PDF and proprietary word-processing formats
are excluded, despite the fact that such documents are usually rich i
n content. The reason for excluding these types of resources is simply
that it would be too expensive and too difficult to attempt to extrac
t a textual representation from them. The operator of a search engine
is simply not motivated to expend the additional resources that would
be needed to handle such documents. The gain would be fairly small, an
d search engines are extremely popular even when they are limited to H
TML and plain text documents. The situation is quite different from th
e point-of-view of the content provider. A site may have significant a
mounts of its content in non-textual documents, but despite this the c
ontent provider may want to have the documents indexed in normal searc
h engines. In this paper we present several server-side solutions that
allow existing indexing software to index the textual representation
of non-textual resources. (C) 1998 Published by Elsevier Science B.V.
All rights reserved.