Content-based video parsing and indexing based on audio-visual interaction

Citation
S. Tsekeridou et I. Pitas, Content-based video parsing and indexing based on audio-visual interaction, IEEE CIR SV, 11(4), 2001, pp. 522-535
Citations number
45
Categorie Soggetti
Eletrical & Eletronics Engineeing
Journal title
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
ISSN journal
10518215 → ACNP
Volume
11
Issue
4
Year of publication
2001
Pages
522 - 535
Database
ISI
SICI code
1051-8215(200104)11:4<522:CVPAIB>2.0.ZU;2-W
Abstract
A content-based video parsing and indexing method is presented in this pape r, which analyzes both information sources (auditory and visual) and accoun ts for their inter-relations and synergy to extract high-level semantic inf ormation. Both frame- and object-based access to the visual information is employed. The aim of the method is to extract semantically meaningful video scenes and assign semantic label(s) to them. Due to the temporal nature of video, time has to be accounted for. Thus, time-constrained video represen tations and indices are generated. The current approach searches for specif ic types of content information relevant to the presence or absence of spea kers or persons. Audio-source parsing and indexing leads to the extraction of a speaker label mapping of the source over time. Video-source parsing an d indexing results in the extraction of a talking-face shot mapping over ti me, Integration of the audio and visual mappings constrained by interaction rules leads to higher levels of video abstraction and even partial detecti on of its context.