XTRACT - AN OVERVIEW

Authors
Citation
F. Smadja, XTRACT - AN OVERVIEW, Computers and the humanities, 26(5-6), 1992, pp. 399-413
Citations number
36
Categorie Soggetti
Art & Humanities General","Computer Sciences, Special Topics","Computer Applications & Cybernetics
ISSN journal
00104817
Volume
26
Issue
5-6
Year of publication
1992
Pages
399 - 413
Database
ISI
SICI code
0010-4817(1992)26:5-6<399:X-AO>2.0.ZU;2-5
Abstract
Lexical collocations have particular statistical distributions. We hav e developed a set of statistical techniques for retrieving and identif ying collocations from large textual corpora. The techniques we develo ped are able to identify collocations of arbitrary length as well as f lexible collocations. These techniques have been implemented in a lexi cographic tool, Xtract, which is able to automatically acquire colloca tions with high retrieval performance. Xtract works in three stages. T he first stage is based on a statistical technique for identifying wor d pairs involved in a syntactic relation. The words can appear in the text in any order and can be separated by an arbitrary number of other words. The second stage is based on a technique to extract n-word col locations (or n-grams) in a much simpler way than related methods. The se collocations can involve closed class words such as particles and p repositions. A third stage is then applied to the output of stage one and applies parsing techniques to sentences involving a given word pai r in order to identify the proper syntactic relation between the two w ords. A secondary effect of the third stage is to filter out a number of candidate collocations as irrelevant and thus produce higher qualit y output. In this paper we present an overview of Xtract and we descri be several uses for Xtract and the knowledge it retrieves such as lang uage generation and machine translation.