Knowledge discovery in grammatically analysed corpora

Citation
S. Wallis et G. Nelson, Knowledge discovery in grammatically analysed corpora, DATA M K D, 5(4), 2001, pp. 305-335
Citations number
35
Categorie Soggetti
AI Robotics and Automatic Control
Journal title
DATA MINING AND KNOWLEDGE DISCOVERY
ISSN journal
13845810 → ACNP
Volume
5
Issue
4
Year of publication
2001
Pages
305 - 335
Database
ISI
SICI code
1384-5810(2001)5:4<305:KDIGAC>2.0.ZU;2-U
Abstract
Collections of grammatically annotated texts (corpora), and in particular, parsed corpora, present a challenge to current methods of analysis. Such co rpora are large and highly structured heterogeneous data sources. In this p aper we briefly describe the parsed one-million word ICE-GB corpus, and the ICECUP query system. We then consider the application of knowledge discove ry in databases (KDD) to text corpora. Following Cupit and Shadbolt (Procee dings 9th European Knowledge Acquisition Workshop, EKAW '96; Berlin: Spring er Verlag, pp. 245-261, 1996), we argue that effective linguistic knowledge discovery must be based on a process of redescription or, more precisely, abstraction, based on the research question to be investigated. Abstraction maps relevant elements from the corpus to an abstract model of the researc h topic. This mapping may be implemented using a grammatical query represen tation such as ICECUP's Fuzzy Tree Fragments (FTFs). Since this abstractive process must be both experimental and expert-guided, ultimately a workbenc h is necessary to maintain, evaluate and refine the abstract model. We conc lude with a pilot study, employing our approach, into aspects of noun phras e postmodifying clause structure. The data is analysed using the UNIT machi ne learning algorithm to search for significant interactions between domain variables. We show that our results are commensurable with those published in the linguistics literature, and discuss how the methodology may be impr oved.