Grounded language learning from videos described with sentences
Haonan Yu and Jeffrey Mark Siskind
The 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013)
Sofia, Bulgaria, August 4-9, 2013
We present a method that learns representations for word meanings from short video clips paired with sentences. Unlike prior work on learning language from symbolic input in the form of text or logical representations extracted from simple images or synthesized video, such as that of blocks against a uniform background, our input consists of video of people interacting with multiple complex objects in outdoor environments. Unlike prior computer-vision approaches that learn object or event models from training data with word-level labels for events in a video and individual objects in particular frames, our labels consist of sentences for the entire video. These sentences can contain nouns, verbs, prepositions, adjectives, and adverbs. Our method does not require labels indicating which words in the sentence correspond to which concepts in the video: it learns the word-to-meaning mappings in an unsupervised fashion. It does so even when the video depicts multiple simultaneous events described by multiple sentences or even when a single event can be described with different sentences describing different aspects of that event. The learning does not require human annotation of the position or motion of the event participants: it determines that on its own. The learned semantic representations can be subsequently used to automatically generate sentential descriptions of new video.
Conference Manager (V2.61.0 - Rev. 2792M)