ITA
ENG

CUSTOMIZED INFORMATION EXTRACTION AS A BASIS FOR RESOURCE DISCOVERY

Authors

HARDY DR SCHWARTZ MF

Citation

Dr. Hardy et Mf. Schwartz, CUSTOMIZED INFORMATION EXTRACTION AS A BASIS FOR RESOURCE DISCOVERY, ACM transactions on computer systems, 14(2), 1996, pp. 171-199

Citations number

Categorie Soggetti

Computer Sciences","Computer Science Theory & Methods

Journal title

ACM transactions on computer systems → ACNP

ISSN journal

07342071

Volume

Issue

Year of publication

1996

Pages

171 - 199

Database

ISI

SICI code

0734-2071(1996)14:2<171:CIEAAB>2.0.ZU;2-K

Abstract

Indexing file contents is a powerful means of helping users locate doc uments, software, and other types of data among large repositories. In environments that contain many different types of data, content index ing requires type-specific processing to extract information effective ly. We present a model for type-specific, user-customizable informatio n extraction, and a system implementation called Essence, This softwar e structure allows users to associate specialized extraction methods w ith ordinary files, providing the illusion of an object-oriented file system that encapsulates indexing methods within files. By exploiting the semantics of common file types, Essence generates compact yet repr esentative file summaries that can be used to improve both browsing an d indexing in resource discovery systems. Essence can extract informat ion from most of the types of files found in common file systems, incl uding files with nested structure (such as compressed ''tar'' files). Essence interoperates with a number of different search/index systems (such as WAIS and Glimpse), as part of the Harvest system.