IDENTIFYING HIERARCHICAL STRUCTURE IN SEQUENCES - A LINEAR-TIME ALGORITHM

Citation
Cg. Nevillmanning et Ih. Witten, IDENTIFYING HIERARCHICAL STRUCTURE IN SEQUENCES - A LINEAR-TIME ALGORITHM, The journal of artificial intelligence research, 7, 1997, pp. 67-82
Citations number
24
Categorie Soggetti
Controlo Theory & Cybernetics","Computer Science Artificial Intelligence
ISSN journal
10769757
Volume
7
Year of publication
1997
Pages
67 - 82
Database
ISI
SICI code
1076-9757(1997)7:<67:IHSIS->2.0.ZU;2-L
Abstract
SEQUITUR is an algorithm that infers a hierarchical structure from a s equence of discrete symbols by replacing repeated phrases with a gramm atical rule that generates the phrase, and continuing this process rec ursively. The result is a hierarchical representation of the original sequence, which offers insights into its lexical structure. The algori thm is driven by two constraints that reduce the size of the grammar, and produce structure as a by-product. S EQUITUR breaks new ground by . operating incrementally. Moreover, the method's simple structure per mits a proof that it operates in space and time that is linear in the size of the input. Our implementation can process 50,000 symbols per s econd and has been applied to an extensive range of real world sequenc es.