A FAST METHOD OF DETERMINING WEIGHTED COMPOUND KEYWORDS FROM TEXT DATABASES

Citation
M. Fuketa et al., A FAST METHOD OF DETERMINING WEIGHTED COMPOUND KEYWORDS FROM TEXT DATABASES, Information processing & management, 34(4), 1998, pp. 431-442
Citations number
11
Categorie Soggetti
Information Science & Library Science","Computer Science Information Systems","Computer Science Information Systems
ISSN journal
03064573
Volume
34
Issue
4
Year of publication
1998
Pages
431 - 442
Database
ISI
SICI code
0306-4573(1998)34:4<431:AFMODW>2.0.ZU;2-H
Abstract
In document management systems, many compound words which are invented freely,can be keyword-candidates. There are two types of criterions f or keyword construction: an individual word and a sequence of words. T he selection of these criterions depends on the system for extracting keywords. Since the method should process many operations for appendin g, separating or comparing of keyword candidates, it is important to p repare an efficient method to extract keywords with information about their relationships among them. This paper presents a technique for st oring compound keywords with information about both short component ke ywords and long component keywords by extending Aho and Corasick (AC) string pattern matching machine for a finite number of keywords. By th eoretical analysis, it is verified that the total cost of the extended PIC machine becomes O(n + k) in comparison with the total cost O(3n) of the original AC machine, where n is the sum of the lengths of key-w ords and k is the number of key-words. By simulation results for 38 Ja panese text files,it is shown that the extended AC machine is about th ree to six times faster than the original AC machine in SC and LC keyw ord processing. (C) 1998 Elsevier Science Ltd. All rights reserved.