M. Fuketa et al., A FAST METHOD OF DETERMINING WEIGHTED COMPOUND KEYWORDS FROM TEXT DATABASES, Information processing & management, 34(4), 1998, pp. 431-442
Citations number
11
Categorie Soggetti
Information Science & Library Science","Computer Science Information Systems","Computer Science Information Systems
In document management systems, many compound words which are invented
freely,can be keyword-candidates. There are two types of criterions f
or keyword construction: an individual word and a sequence of words. T
he selection of these criterions depends on the system for extracting
keywords. Since the method should process many operations for appendin
g, separating or comparing of keyword candidates, it is important to p
repare an efficient method to extract keywords with information about
their relationships among them. This paper presents a technique for st
oring compound keywords with information about both short component ke
ywords and long component keywords by extending Aho and Corasick (AC)
string pattern matching machine for a finite number of keywords. By th
eoretical analysis, it is verified that the total cost of the extended
PIC machine becomes O(n + k) in comparison with the total cost O(3n)
of the original AC machine, where n is the sum of the lengths of key-w
ords and k is the number of key-words. By simulation results for 38 Ja
panese text files,it is shown that the extended AC machine is about th
ree to six times faster than the original AC machine in SC and LC keyw
ord processing. (C) 1998 Elsevier Science Ltd. All rights reserved.