ITA
ENG

Linguini: Language identification for multilingual documents

Authors

Prager, JM

Citation

Jm. Prager, Linguini: Language identification for multilingual documents, J MANAG I S, 16(3), 1999, pp. 71-101

Citations number

Categorie Soggetti

Library & Information Science

Journal title

JOURNAL OF MANAGEMENT INFORMATION SYSTEMS

ISSN journal

07421222 → ACNP

Volume

Issue

Year of publication

1999

Pages

71 - 101

Database

ISI

SICI code

0742-1222(199924)16:3<71:LLIFMD>2.0.ZU;2-7

Abstract

Given the vast and still growing availability of electronic documents from around the world, it is becoming increasingly important for managers of the information systems on which these documents are stored to sort or tag the se documents so that their end users can most readily access those document s that are of most interest and use to them, which in our context means in a language they can understand. Linguini is a vector-space-based categorize r tailored for high-precision language identification. This paper determine s the functional dependencies of Linguini's performance and demonstrates th at it can identify the language of documents as short as 5 to 10 percent of the size of average Web documents with 100 percent accuracy. It also descr ibes how to determine if a document is in two or more languages, without in curring any appreciable extra computational overhead. This approach can be applied equally to subject-categorization systems to distinguish between ca ses where, when the system recommends two or more categories, the document belongs strongly to all or really to none.