An efficient document clustering algorithm and its application to a document browser

Citation
H. Tanaka et al., An efficient document clustering algorithm and its application to a document browser, INF PR MAN, 35(4), 1999, pp. 541-557
Citations number
18
Categorie Soggetti
Library & Information Science","Information Tecnology & Communication Systems
Journal title
INFORMATION PROCESSING & MANAGEMENT
ISSN journal
03064573 → ACNP
Volume
35
Issue
4
Year of publication
1999
Pages
541 - 557
Database
ISI
SICI code
0306-4573(199907)35:4<541:AEDCAA>2.0.ZU;2-V
Abstract
We present an efficient document clustering algorithm that uses a term freq uency vector for each document instead of using a huge proximity matrix. Th e algorithm has the following features: (1) it requires a relatively small amount of memory and runs fast, (2) it produces a hierarchy in the form of a document classification tree and (3) the hierarchy obtained by the algori thm explicitly reveals a collection structure. We confirm these features an d thus show the algorithm's feasibility through clustering experiments in w hich we use two collections of Japanese documents, the sizes of which are 8 3,099 and 14,701 documents. We also introduce an application of this algori thm to a document browser. This browser is used in our Japanese-to-English translation aid system. The browsing module of the system consists of a hug e database of Japanese news articles and their English translations. The Ja panese article collection is clustered into a hierarchy by our method. Sinc e each node in the hierarchy corresponds to a topic in the collection, we c an use the hierarchy to directly access articles by topic. A user can learn general translation knowledge of each topic by browsing the Japanese artic les and their English translations. We also discuss techniques of presentin g a large tree-formed hierarchy on a computer screen. (C) 1999 Elsevier Sci ence Ltd. All rights reserved.