RESEARCH NEWS

ARCHIVE



ECOVISION HOME




...


Multi-Modal Primitives for Scene Analysis


We define local multi-modal visual Primitives that represent scene information in a condensed and sparse way. To each attribute confidences are associated that are subject to contextual information via recurrent processes across modalities, across spatial distances and time frames. In this way, in a process of recurrent predictions visual information becomes more reliable.

March 18, 2003 — The aim of this work is to compute reliable feature maps from natural scenes. We believe that to establish artificial systems that perform reliable actions we need reliable features. These can only be computed through integration across the spatial and temporal context and across visual modalities since local feature extraction is necessarily ambigious. The European Project ECOVISION focusses exactly on this issue and the work described here is a central pillar of this ongoing project. We have developped a new kind of image representation in terms of local multi--modal Primitives. These Primitives can be characterized by four properties: Multi-modality: Different domains that describe different kind of structures in visual data are well established in human vision and computer vision. For example, a local edge can be analyzed by local feature attributes such as orientation or energy in certain frequency bands. In addition, we can distinguish between line and step--edge like structures (constrast transition). Furthermore, color can be associated to the edge. This image patch also changes in time due to ego-motion or object motion. Therefore time specific features such as a 2D velocity vector (optic flow) can be associated to this image patch. In this work we define local multi--modal Primitives that realize these multi-modal relations. The modalities, in addition to the usually applied semantic parameters position and orientation, are contrast transition, color and optic flow. Adaptability: Since the interpretation of local image patches in terms of the above mention attributes as well as classifications such as `edgeness' or `junctionness' are necessarilly ambigious when based on local processing stable interpretations can only be achieved through integration by making use of contextual information. Therefore, all attributes of our Primitives are equipped with a confidence that is essentially adaptable according to contextual information expressing the reliability of this attribute. Furthermore, the feature attributes itself adapts according to the context. Condensation: Integration of information requires communication between Primitives expressing spatial and temporal dependencies. This communication has necessarily to be paid for with a certain cost. This cost can be reduced by limiting the amount of information transferred from one place to the other, i.e., by reducing the bandwidth. Therefore we are after a condensed representation. Also for other tasks, e.g., to learn objects it is essential to store information in a condensed way to reduce memory requirements. Meaningfulness: Communication and memorization not only require a reduction of information. We want to reduce the amount of information within an image patch while preserving perceptually relevant information. This leads to meaningful descriptors such as our attributes position, orientation, contrast transition, color and optic flow.



Norbert Krüger
Michael Felsberg
Florentin Wörgötter
Department of Psychology
University of Stirling



Top of Page


Date Modified: February 11, 2003 by S.P. Sabatini