This research explores the interaction of textual and photographic inf
ormation in image understanding. Specifically, it presents a computati
onal model whereby textual captions are used as collateral information
in the interpretation of the corresponding photographs. The final und
erstanding of the picture and caption reflects a consolidation of the
information obtained from each of the two sources and can thus be used
in intelligent information retrieval tasks. The problem of building a
general-purpose computer vision system without a priori knowledge is
very difficult at best. The concept of using collateral information in
scene understanding has been explored in systems that use general sce
ne context in the task of object identification. The work described he
re extends this notion by incorporating picture specific information.
A multi-stage system FICTION which uses captions to identify humans in
an accompanying photograph is described. This provides a computationa
lly less expensive alternative to traditional methods of face recognit
ion. A key component of the system is the utilisation of spatial and c
haracteristic constraints (derived from the caption) in labeling face
candidates (generated by a face locator).