Sk. Gupta et J. Schroeter, PITCH-SYNCHRONOUS FRAME-BY-FRAME AND SEGMENT-BASED ARTICULATORY ANALYSIS-BY-SYNTHESIS, The Journal of the Acoustical Society of America, 94(5), 1993, pp. 2517-2530
This paper presents a pitch-synchronous analysis-by-synthesis procedur
e for estimating model parameters for voiced speech. These model param
eters describe the vocal-tract shape and the time derivative of the gl
ottal area function. The excitation waveform is derived from the glott
al area function by incorporating source-tract interaction using the c
urrent vocal-tract input impedance. The corresponding analysis procedu
re for estimating the model parameters once every pitch period is outl
ined. A significant improvement in quality was obtained for the new pi
tch-synchronous analysis/synthesis procedure relative to the fixed-fra
me-length-based scheme used previously. It was also found that the new
pitch-synchronous articulatory analysis/synthesis scheme achieves low
er rms spectral distortion values than the 2.4 kb/s. Federal standard
LPC-10E algorithm. A segment-based procedure for estimating the vocal-
tract model parameters at a rate much lower than the current pitch is
described. In this segment-based analysis-by-synthesis approach, the m
odel parameters are estimated every 50-100 ms. The parameters for the
intermediate pitch periods are derived by interpolation. The segments
are selected using a maximum likelihood segmentation algorithm that se
gments an utterance into diphonelike units. A segment-based parameter
optimization scheme could lead to a highly economical representation o
f the speech signal for potential applications in very low bit rate sp
eech coding and-speech storage. The above schemes were optimized for a
pilot test sentence and then evaluated using eight test sentences for
a log area and the Coker articulatory model representation of the voc
al tract. Nine listeners were asked to judge the quality of the synthe
sis in a paired-comparison test and the results were analyzed using a
nonparametric one-tailed sign test. For the log-area representation of
the vocal tract, we found a significant degradation in speech quality
for the segment-based optimization procedure relative to the frame-ba
sed procedure. However, for the Coker model representation, the degrad
ation was found to be insignificant. This shows that unlike cross-sect
ional areas, the movement of various articulators in the vocal tract d
uring speech production can be described with sufficient accuracy by s
pecifying the position of these articulators and by using an interpola
tion function at time intervals much longer than a pitch period.