One of the major drawbacks of current acoustically based speech recogn
izers is that their performance deteriorates drastically with noise. O
ur focus is to develop a computer system that performs speech recognit
ion based on visual information concerning the speaker. The system aut
omatically extracts visual speech features through image-processing te
chniques that operate on facial images taken in a normally illuminated
environment. To cope with the dynamic nature of change in speech patt
erns with respect to time as well as the spatial variations in the ind
ividual patterns, the proposed recognition scheme uses a recurrent neu
ral network architecture. By specifying a certain behavior when the ne
twork is presented with exemplar sequences, the recurrent network is t
rained with no more than feedforward complexity. The network's desired
behavior is based on characterizing a given word by well-defined segm
ents. Adaptive segmentation is employed to segment the training sequen
ces of a given class. This technique iterates the execution of two ste
ps. First, the sequences are segmented individually. Then, a generaliz
ed version of dynamic time warping is used to align the segments of ai
l sequences. At each iteration, the weights of the distance functions
used in the two steps are updated in a way that minimizes a segmentati
on error. The system is implemented and tested an a few words. The res
ults are satisfactory. In particular, the system is able to distinguis
h between words with common segments. Moreover, it tolerates to a grea
t extent variable-duration words of the same class. (C) 1998 SPIE and
IS&T. [S1017-9909(98)00701-6].