ITA
ENG

A FAST METHOD OF DETERMINING WEIGHTED COMPOUND KEYWORDS FROM TEXT DATABASES

Authors

FUKETA M MIZOFUCHI S HAYASHI Y AOE JI

Citation

M. Fuketa et al., A FAST METHOD OF DETERMINING WEIGHTED COMPOUND KEYWORDS FROM TEXT DATABASES, Information processing & management, 34(4), 1998, pp. 431-442

Citations number

Categorie Soggetti

Information Science & Library Science","Computer Science Information Systems","Computer Science Information Systems

Journal title

Information processing & management → ACNP

ISSN journal

03064573

Volume

Issue

Year of publication

1998

Pages

431 - 442

Database

ISI

SICI code

0306-4573(1998)34:4<431:AFMODW>2.0.ZU;2-H

Abstract

In document management systems, many compound words which are invented freely,can be keyword-candidates. There are two types of criterions f or keyword construction: an individual word and a sequence of words. T he selection of these criterions depends on the system for extracting keywords. Since the method should process many operations for appendin g, separating or comparing of keyword candidates, it is important to p repare an efficient method to extract keywords with information about their relationships among them. This paper presents a technique for st oring compound keywords with information about both short component ke ywords and long component keywords by extending Aho and Corasick (AC) string pattern matching machine for a finite number of keywords. By th eoretical analysis, it is verified that the total cost of the extended PIC machine becomes O(n + k) in comparison with the total cost O(3n) of the original AC machine, where n is the sum of the lengths of key-w ords and k is the number of key-words. By simulation results for 38 Ja panese text files,it is shown that the extended AC machine is about th ree to six times faster than the original AC machine in SC and LC keyw ord processing. (C) 1998 Elsevier Science Ltd. All rights reserved.