Bulgarian Academy of Sciences

Features automatically extracted from images constitute a new and rich source of semantic knowledge that can complement information extracted from text. The convergence between vision- and text-based information can be exploited in scenarios where the two modalities must be combined to solve a target task (e.g., generating verbal descriptions of images, or finding the right images to illustrate a story). However, the potential applications for integrated visual features go beyond mixed-media scenarios: Because of their complementary nature with respect to language, visual features might provide perceptually grounded semantic information that can be exploited in purely linguistic domains.

The tutorial will first introduce basic techniques to encode image contents in terms of low-level features, such as the widely adopted SIFT descriptors. We will then show how these low-level descriptors are used to induce more abstract features, focusing on the well-established bags-of-visual-words method to represent images, but also briefly introducing more recent developments, that include capturing spatial information with pyramid representations, soft visual word clustering via Fisher encoding and attribute-based image representation. Next, we will discuss some example applications, and we will conclude with a brief practical illustration of visual feature extraction using a software package we developed.