D1 - 12/12/2000 Le présent document contient des informations qui sont la propriété de France Télécom. L'acceptation de ce document par son destinataire

D1 - 12/12/2000

Le présent document contient des informations qui sont la propriété de France Télécom. L'acceptation de ce document par son destinataire implique, de la part de ce dernier, la reconnaissance du caractère confidentiel de son contenu et l'engagement de n'en faire aucune reproduction, aucune transmission à des tiers, aucune divulgation et aucune utilisation commerciale sans l'accord préalable écrit de France Télécom R&D

France Telecom's expectations and research in Object Recognition

Henri Sanson, Christophe Laurent, Olivier Bernier

D2 - 12/12/2000France Télécom R&D

Outline

France Telecom Markets and context evolution

Visual content indexing: from low-level to semantic description

Overview of current research in Image retrieval and video annotation

Object recognition for Human Computer Interface


France Telecom Markets and Applications

France Telecom is a global Telecommunication operator: from telephony to multimedia and audiovisual services

Fixed networks/services:Mobiles networks/services:Internet access and servicesIP Data and communication services for corporations

2 structuring trends:An increasing importance and presence of visual contents in the services:

leveraging the higher bandwidthhigh value added services (people pay for content and the means to

reach it easily)A need to compensate an increasing technological and functional

complexity by providing natural Human Machine/Service InterfacesVocal InterfacesVisual interfaces


Visual content indexing: from low-level to semantic description (1)

Context: more visual content, huge data volumes, temporal constraint for videosNeed for efficient indexing methods enabling fast and relevant access

Applications:Media asset management: in addition to traditional audiovisual companies

(TV, production), more and more enterprises own image or video assets and face management issues.

Relevance has prime importanceBut cost effectiveness is becoming more and more accute

Web search and filtering enginesHuge volume, very variable: robust automatic indexing is the only

solutionAlthough surrounding text may be used, the visual content itself is

the only reliable source to useVideo surveillance:

Specific environment and content typeAutomatic processing



Traditionnally 2 radically opposite approaches: Accurate but manual Semantic Annotation: Ontologies

Time consumingThe indexing choices limit possible queries

Low-level –based feature descriptions: aka "Color, Texture and Shape" MPEG-7 Visual Framework

Automatic processing but very limited in practice (save for some classification purposes): relies on query-by-example, little usable

Emerging trend: convergence of both previous paradigmsAutomated knowledge-based semantic indexing using visual recognitionMany advantages:

Semantic and automaticNo linguistic fenceIndexing complementation is always possible

But still difficult !



Constraints of the application impacting the recognition: High variability of shooting conditions:

same objects appear very differently: color, pose, scale, shadows,

High variability in the content type (indoor, outdoor, News, movies, sports)

Potentially huge number of objects or object categories to recognize concurrently

Video Real time working targeted, even much faster for still image

Recognition approach as flexible as possible is expected for generic objects

Qualification of the methods must in fine be done by real experimentation "on the ground" by true end users, and is measured by their satisfaction rate.


Current work (1)

Research:

Color space invariance w.r.t shooting conditions

Salient feature-based image retrieval and object description

Face detection and recognition in generic images:

Development: Video indexing platform:

Video: shot change detection, specific image labelling (news speaker, weather report/ commercal gingle), face detection, text detection

Audio/speech: speech/music/other segmentation, keywords recognition, free vocabulary phonetic search


Saliency-based Color Image Indexing

Image signature is extracted from a limited number of perceptually important pixels called salient points

Salient points are computed by combining a discrete wavelet transform with a Zerotree representation of wavelet coefficients

Salient points are located on most sharp boundaries

The image signature is composed of a color correlogram computed in the neighborhood of each salient point

This signature can be completed with a texture signature computed around the salient points

An invariant color space (c1c2c3) is used to be robust to imaging conditions


Salient Points Extraction


Experimental Results

Database containing ~2000 TV images Extraction of 18 difficult requests Computation of ranking metric Comparaison with the MPEG-7 SCD (Scalable Color Descriptor)

Outperforms SCD

in 90% of cases !


Object recognition for Human Computer Interface (1)

Context:Services functionalities are everyday more and more sophisticatedEnd users are expecting simpler user interfacesVisual interactions appear to be a good complement to more usual vocal

interfaces:Universality: much less constrained by linguistical variabilityWeb cams are widespreadMaybe less sensitive to environmental noise/ capturing conditionsPermits fast interaction: an image worths 1000 words


Visual recognition for Human-Computer Interfaces

Face detection and tracking

Neural Network based face detection for still images.Extension to real time face detection in video streams.Real time face tracking for HCI using statistical models (EM, particle

filtering).

Gesture recognition

Static hand posture recognition based on neural networks.Dynamic gesture recognition (HMMS, IOHMMs and GIOHMMs).Body tracking in 2D.Body tracking in 3D using disparity cameras (Triclops).

Documents

D1 - 12/12/2000 Le présent document contient des informations qui sont la propriété de France Télécom. L'acceptation de ce document par son destinataire