Tracking human lips in video is an important but notoriously difficult task
. To accurately recover their motions in 3D from any head pose is an even m
ore challenging task, though still necessary for natural interactions. Our
approach is to build and train 3D models of lip motion to make up for the i
nformation we cannot always observe when tracking. We use physical models a
s a prior and combine them with statistical models, showing how the two can
be smoothly and naturally integrated into a synthesis method and a MAP est
imation framework for tracking. We have found that this approach allows us
to accurately and robustly track and synthesize the 3D shape of the lips fr
om arbitrary head poses in a 2D video stream. We demonstrate this with nume
rical results on reconstruction accuracy, examples of static fits, and audi
o-visual sequences. (C) 1998 Elsevier Science B.V. All rights reserved.