A content-based video parsing and indexing method is presented in this pape
r, which analyzes both information sources (auditory and visual) and accoun
ts for their inter-relations and synergy to extract high-level semantic inf
ormation. Both frame- and object-based access to the visual information is
employed. The aim of the method is to extract semantically meaningful video
scenes and assign semantic label(s) to them. Due to the temporal nature of
video, time has to be accounted for. Thus, time-constrained video represen
tations and indices are generated. The current approach searches for specif
ic types of content information relevant to the presence or absence of spea
kers or persons. Audio-source parsing and indexing leads to the extraction
of a speaker label mapping of the source over time. Video-source parsing an
d indexing results in the extraction of a talking-face shot mapping over ti
me, Integration of the audio and visual mappings constrained by interaction
rules leads to higher levels of video abstraction and even partial detecti
on of its context.