Digital libraries store materials in electronic format. Research and develo
pment in digital libraries includes content creation, conversion, indexing,
organization, and dissemination. The key technological issues are how to s
earch and display desired selections from and across large collections effe
ctively [Schatz & Chen, 1996]. Digital library research projects (DLI-1) sp
onsored by NSF/DARPA/NASA have a common theme of bringing search to the net
, which is the flagship research effort for the National Information Infras
tructure (NII) in the United States. A repository is an indexed collection
of objects. Indexing is an important task for searching. The better the ind
exing, the better the searching result. Developing a universal digital libr
ary has been the dream of many researchers, however, there are still many p
roblems to be solved before such a vision is fulfilled. The most critical i
s to support a cross-lingual retrieval or multilingual digital library. Muc
h work has been done on English information retrieval, however, there is re
latively less work on Chinese information retrieval. In this article, we fo
cus on Chinese indexing, which is the foundation of Chinese and cross-lingu
al information retrieval. The smallest indexing units in Chinese digital li
braries are words, while the smallest units in a Chinese sentence are chara
cters. However, Chinese text has no delimiter to mark word boundaries as it
is in English text. In English or other languages using Roman or Greek-bas
ed orthographies, often, spacing reliably indicates word boundaries. In Chi
nese, a number of characters are placed together without any delimiters ind
icating the boundaries between consecutive characters. In this article, we
investigate the combination and boundary detection approaches based on mutu
al information for segmentation. The combination approach combines n-grams
to form words with more number of characters. In the combination approach A
lgorithm 1 does not allow overlapping of n-grams while Algorithm 2 does. Th
e boundary detection approach detects the segmentation points on a sentence
based on the values and the change of values of the mutual information. Ex
periments are conducted to evaluate their performances. An interface of the
system is also presented to show how a Chinese web page is downloaded, the
text in the page filtered, and segmented into words. The segmented words c
an be submitted for indexing or new unknown words can be identified and sub
mitted to a dictionary.