Combination and boundary detection approaches on Chinese indexing

Citation
Cc. Yang et al., Combination and boundary detection approaches on Chinese indexing, J AM S INFO, 51(4), 2000, pp. 340-351
Citations number
30
Categorie Soggetti
Library & Information Science
Journal title
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE
ISSN journal
00028231 → ACNP
Volume
51
Issue
4
Year of publication
2000
Pages
340 - 351
Database
ISI
SICI code
0002-8231(20000301)51:4<340:CABDAO>2.0.ZU;2-M
Abstract
Digital libraries store materials in electronic format. Research and develo pment in digital libraries includes content creation, conversion, indexing, organization, and dissemination. The key technological issues are how to s earch and display desired selections from and across large collections effe ctively [Schatz & Chen, 1996]. Digital library research projects (DLI-1) sp onsored by NSF/DARPA/NASA have a common theme of bringing search to the net , which is the flagship research effort for the National Information Infras tructure (NII) in the United States. A repository is an indexed collection of objects. Indexing is an important task for searching. The better the ind exing, the better the searching result. Developing a universal digital libr ary has been the dream of many researchers, however, there are still many p roblems to be solved before such a vision is fulfilled. The most critical i s to support a cross-lingual retrieval or multilingual digital library. Muc h work has been done on English information retrieval, however, there is re latively less work on Chinese information retrieval. In this article, we fo cus on Chinese indexing, which is the foundation of Chinese and cross-lingu al information retrieval. The smallest indexing units in Chinese digital li braries are words, while the smallest units in a Chinese sentence are chara cters. However, Chinese text has no delimiter to mark word boundaries as it is in English text. In English or other languages using Roman or Greek-bas ed orthographies, often, spacing reliably indicates word boundaries. In Chi nese, a number of characters are placed together without any delimiters ind icating the boundaries between consecutive characters. In this article, we investigate the combination and boundary detection approaches based on mutu al information for segmentation. The combination approach combines n-grams to form words with more number of characters. In the combination approach A lgorithm 1 does not allow overlapping of n-grams while Algorithm 2 does. Th e boundary detection approach detects the segmentation points on a sentence based on the values and the change of values of the mutual information. Ex periments are conducted to evaluate their performances. An interface of the system is also presented to show how a Chinese web page is downloaded, the text in the page filtered, and segmented into words. The segmented words c an be submitted for indexing or new unknown words can be identified and sub mitted to a dictionary.