ITA
ENG

XTRACT - AN OVERVIEW

Authors

SMADJA F

Citation

F. Smadja, XTRACT - AN OVERVIEW, Computers and the humanities, 26(5-6), 1992, pp. 399-413

Citations number

Categorie Soggetti

Art & Humanities General","Computer Sciences, Special Topics","Computer Applications & Cybernetics

Journal title

Computers and the humanities → ACNP

ISSN journal

00104817

Volume

Issue

5-6

Year of publication

1992

Pages

399 - 413

Database

ISI

SICI code

0010-4817(1992)26:5-6<399:X-AO>2.0.ZU;2-5

Abstract

Lexical collocations have particular statistical distributions. We hav e developed a set of statistical techniques for retrieving and identif ying collocations from large textual corpora. The techniques we develo ped are able to identify collocations of arbitrary length as well as f lexible collocations. These techniques have been implemented in a lexi cographic tool, Xtract, which is able to automatically acquire colloca tions with high retrieval performance. Xtract works in three stages. T he first stage is based on a statistical technique for identifying wor d pairs involved in a syntactic relation. The words can appear in the text in any order and can be separated by an arbitrary number of other words. The second stage is based on a technique to extract n-word col locations (or n-grams) in a much simpler way than related methods. The se collocations can involve closed class words such as particles and p repositions. A third stage is then applied to the output of stage one and applies parsing techniques to sentences involving a given word pai r in order to identify the proper syntactic relation between the two w ords. A secondary effect of the third stage is to filter out a number of candidate collocations as irrelevant and thus produce higher qualit y output. In this paper we present an overview of Xtract and we descri be several uses for Xtract and the knowledge it retrieves such as lang uage generation and machine translation.