Two approaches to detecting and tracking speakers in multispeaker audio are
described. Both approaches use an adapted Gaussian mixture model, universa
l background model (GMM-UBM) speaker detection system as the core speaker r
ecognition engine. In one approach, the individual log-likelihood ratio sco
res, which are produced on a frame-by-frame basis by the GMM-UBM system, ar
e used to first partition the speech file into speaker homogenous regions a
nd then to create scores for these regions. We refer to this approach as in
ternal segmentation. Another approach uses an external segmentation algorit
hm, based on blind clustering, to partition the speech file into speaker ho
mogenous regions. The adapted GMM-UBM system then scores each of these regi
ons as in the single-speaker recognition case. We show that the external se
gmentation system outperforms the internal segmentation system for both det
ection and tracking. In addition, we show how different components of the d
etection and tracking algorithms contribute to the overall system performan
ce. (C) 2000 Academic Press.