ITA
ENG

Partitioning-based clustering for Web document categorization

Authors

Boley, D Gini, M Gross, R Han, EH Hastings, K Karypis, G Kumar, V Mobasher, B Moore, J

Citation

D. Boley et al., Partitioning-based clustering for Web document categorization, DECIS SUP S, 27(3), 1999, pp. 329-341

Citations number

Categorie Soggetti

AI Robotics and Automatic Control

Journal title

DECISION SUPPORT SYSTEMS

ISSN journal

01679236 → ACNP

Volume

Issue

Year of publication

1999

Pages

329 - 341

Database

ISI

SICI code

0167-9236(199912)27:3<329:PCFWDC>2.0.ZU;2-Z

Abstract

Clustering techniques have been used by many intelligent software agents in order to retrieve, filter, and categorize documents available on the World Wide Web. Clustering is also useful in extracting salient features of rela ted Web documents to automatically formulate queries and search for other s imilar documents on the Web. Traditional clustering algorithms either use a priori knowledge of document structures to define a distance or similarity among these documents, or use probabilistic techniques such as Bayesian cl assification. Many of these traditional algorithms, however, falter when th e dimensionality of the feature space becomes high relative to the size of the document space. In this paper, we introduce two new clustering algorith ms that can effectively cluster documents, even in the presence of a very h igh dimensional feature space. These clustering techniques, which are based on generalizations of graph partitioning, do not require pre-specified ad hoc distance functions, and are capable of automatically discovering docume nt similarities or associations. We conduct several experiments on real Web data using various feature selection heuristics, and compare our clusterin g schemes to standard distance-based techniques, such as hierarchical agglo meration clustering, and Bayesian classification methods, such as AutoClass . (C) 1999 Elsevier Science B.V. All rights reserved.