The words spoken in an audio-visual. document form an obvious and intuitive
metadata component. This component is essential to ensure comprehensive co
verage of audio-visual content by the MPEG-7 standard. With manual transcri
ption prohibitively costly, such metadata will typically be derived from au
tomatic speech recognition systems. The errors inherent in the output of su
ch extraction tools cause particular difficulties for robust retrieval, as
well as for interoperability in heterogeneous databases. We describe a stru
cture comprising a probabilistic combined word and phone lattice along with
an explanatory metadata header and detail how this structure avoids of ame
liorates these problems.