ITA
ENG

An efficient document clustering algorithm and its application to a document browser

Authors

Tanaka, H Kumano, T Uratani, N Ehara, T

Citation

H. Tanaka et al., An efficient document clustering algorithm and its application to a document browser, INF PR MAN, 35(4), 1999, pp. 541-557

Citations number

Categorie Soggetti

Library & Information Science","Information Tecnology & Communication Systems

Journal title

INFORMATION PROCESSING & MANAGEMENT

ISSN journal

03064573 → ACNP

Volume

Issue

Year of publication

1999

Pages

541 - 557

Database

ISI

SICI code

0306-4573(199907)35:4<541:AEDCAA>2.0.ZU;2-V

Abstract

We present an efficient document clustering algorithm that uses a term freq uency vector for each document instead of using a huge proximity matrix. Th e algorithm has the following features: (1) it requires a relatively small amount of memory and runs fast, (2) it produces a hierarchy in the form of a document classification tree and (3) the hierarchy obtained by the algori thm explicitly reveals a collection structure. We confirm these features an d thus show the algorithm's feasibility through clustering experiments in w hich we use two collections of Japanese documents, the sizes of which are 8 3,099 and 14,701 documents. We also introduce an application of this algori thm to a document browser. This browser is used in our Japanese-to-English translation aid system. The browsing module of the system consists of a hug e database of Japanese news articles and their English translations. The Ja panese article collection is clustered into a hierarchy by our method. Sinc e each node in the hierarchy corresponds to a topic in the collection, we c an use the hierarchy to directly access articles by topic. A user can learn general translation knowledge of each topic by browsing the Japanese artic les and their English translations. We also discuss techniques of presentin g a large tree-formed hierarchy on a computer screen. (C) 1999 Elsevier Sci ence Ltd. All rights reserved.