Collections of grammatically annotated texts (corpora), and in particular,
parsed corpora, present a challenge to current methods of analysis. Such co
rpora are large and highly structured heterogeneous data sources. In this p
aper we briefly describe the parsed one-million word ICE-GB corpus, and the
ICECUP query system. We then consider the application of knowledge discove
ry in databases (KDD) to text corpora. Following Cupit and Shadbolt (Procee
dings 9th European Knowledge Acquisition Workshop, EKAW '96; Berlin: Spring
er Verlag, pp. 245-261, 1996), we argue that effective linguistic knowledge
discovery must be based on a process of redescription or, more precisely,
abstraction, based on the research question to be investigated. Abstraction
maps relevant elements from the corpus to an abstract model of the researc
h topic. This mapping may be implemented using a grammatical query represen
tation such as ICECUP's Fuzzy Tree Fragments (FTFs). Since this abstractive
process must be both experimental and expert-guided, ultimately a workbenc
h is necessary to maintain, evaluate and refine the abstract model. We conc
lude with a pilot study, employing our approach, into aspects of noun phras
e postmodifying clause structure. The data is analysed using the UNIT machi
ne learning algorithm to search for significant interactions between domain
variables. We show that our results are commensurable with those published
in the linguistics literature, and discuss how the methodology may be impr
oved.