Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet,...

Preview:

Citation preview

Speaker Recognition

G. CHOLLET, G. GRAVIER,J. KHARROUBI, D. PETROVSKA-DELACRETAZ

(chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr

ENST/CNRS-LTCI46 rue Barrault

75634 PARIS cedex 13http://www.tsi.enst.fr/~chollet

ENST:ENST: Ecole Nationale Supérieure des Ecole Nationale Supérieure des TélécommunicationsTélécommunicationshttp://www.enst.frhttp://www.enst.fr

CNRS:CNRS: Centre National de la Recherche ScientifiqueCentre National de la Recherche Scientifiquehttp://www.cnrs.frhttp://www.cnrs.fr

LTCI:LTCI: Laboratoire de Traitement et Communication Laboratoire de Traitement et Communication de l’Informationde l’Information

http://www.enst.fr/ura/ura.htmlhttp://www.enst.fr/ura/ura.html

Our affiliations

What is ENST?Ecole Nationale Supérieure des

Télécommunications

• classed among the

‘Grandes Ecoles d'Ingénieurs’.

• 250 state certified engineers

each year .

• part of ‘Groupement des Ecoles

de Télécommunications’

Modalities for Identity Verification

Bla-bla

SECUREDSPACE

PIN PIN 1111111111111111

11

Modalities for Identity Verification

A device you own (key, smart card,…) A code you remember (password, …)

Could be lost or stolen Physiological characteristics:

Face, iris, finger print, hand shape,… Need special equipment

Behavioral characteristics: Speech, signature, keystroke,…

Speech is the prefered modality over the telephone(but a ‘voice print’ is much more variable than a

finger print)

Outline

Where is the information about the speaker identity in the speech signal ?

How well could humans recognize a speaker ? Applications of Speaker Recognition Prior knowledge on what the speaker said Combining Speech Recognition and Speaker Verification Some research activities at ENST:

Speaker verification: The CAVE-PICASSO projects (text dependent) The ELISA consortium, NIST evaluations (text

independent) The EUREKA !2340 MAJORDOME project

Multimodal Identity Verification: The M2VTS and BIOMET projects

Perspectives

Speaker Identity in Speech

Differences in Vocal tract shapes and muscular control Fundamental frequency (typical values)

100 Hz (Male), 200 Hz (Female), 300 Hz (Child) Glottal waveform Phonotactics Lexical usage

The differences between Voices of Twins is a limit case

Voices can also be imitated or disguised

spectral envelope of / i: /

f

A

Speaker A

Speaker B

Speaker Identity

segmental factors (~30ms)

glottal excitation:fundamental frequency, amplitude,voice quality (e.g., breathiness)

vocal tract:formant frequencies and bandwidths

suprasegmental factors speaking speed (timing and rhythm of speech units) intonation patterns dialect, accent, pronunciation habits

Inter-speaker Variability

We wereaway

ayear ago.

Intra-speaker Variability

We

were

away

a

year

ago.

Vocal Apparatus

Speech production

Glottal Waveform Modeling

t

A

original residual: bluesynthetic residual: red

• Fitting a glottal pulse model to the excitation waveform allows perceptually relevant modifications to voice quality

Applications of Speaker Recognition

Identification from an open set (unrealistic) Identification from a closed set (who is

speaking in a videoconference ?) Verification of claimed identity (risk of

deliberate imposture)

The human performance in speaker recognition is far from being perfect (highly dependent on familiarity with the subject)

Speaker Verification

Typology of approaches (EAGLES Handbook) Text dependent

Public password Private password Customized password Text prompted

Text independent Incremental enrolment Evaluation

What are the sources of difficulty ?

Intra-speaker variability of the speech signal (due to stress, pathologies, environmental conditions,…)

Recording conditions (filtering, noise,…) Temporal drift Intentional imposture Voice disguise

Text-dependent Speaker Verification

Uses Automatic Speech Recognition techniques (DTW, HMM, …)

Client model adaptation from speaker independent HMM (‘World’ model)

Synchronous alignment of client and world models for the computation of a score.

Dynamic Time Warping (DTW)

HMM structure depends on the application

Signal detection theory

Score normalisation

World model

Cohort normalisation

Discriminant techniques

Detection Error Tradeoff (DET) Curve

CAVE – PICASSOhttp://www.picasso.ptt-telecom.nl/project/

Incremental enrolment of customised password

The client chooses his password using some feedback from the system.

The system attempts a phonetic transcription of the password.

Incremental enrolment is achieved on further repetitions of that password

Speaker independent phone HMM are adapted with the client enrolment data.

Synchronous alignment likelihood ratio scoring is performed on access trials.

Deliberate imposture

The impostor has some recordings of the target client voice. He can record the same sentences and align these speech signals with the recordings of the client.

A transformation (Multiple Linear Regression) is computed from these aligned data.

The impostor has heard the target client password. He records that password and applies the

transformation to this recording. The PICASSO reference system with less than 1 %

EER is defeated by this procedure (more than 30 % EER)

Speaker Verification (text independent)

The ELISA consortium ENST, LIA, IRISA, ... http://www.lia.univ-avignon.fr/equipes/RAL/elisa/

index_en.html

NIST evaluations http://www.nist.gov/speech/tests/spk/index.

htm

Ergodic HMM Gaussian Mixture Model

Gaussian Mixture Model

Parametric representation of the probability distribution of observations:

Gaussian Mixture Models

8 Gaussians per mixture

National Institute of Standards & Technology (NIST)

Speaker Verification Evaluations

• Annual evaluation since 1995• Common paradigm for comparing technologies

GMM speaker modeling

Front-endGMM

MODELING

WORLDGMM

MODEL

Front-end GMM model adaptation

TARGETGMM

MODEL

Baseline GMM method

HYPOTH.TARGET

GMM MOD.

Front-end

WORLDGMM

MODEL

Test Speech

xPxPLog ]

)/()/([

LLR SCORE

)/( xP

)/( xP

=

Support Vector Machines and Speaker Verification

Hybrid GMM-SVM system is proposed

SVM scoring model trained on development data to classify true-target speakers access and impostors access,using new feature representation based on GMMs

Modeling

Scoring

GMM

SVM

SVM principles

X (X)

Inpu

t sp

ace

Feat

ure

spac

e Separating hyperplans H , with the optimal hyperplan Ho

Ho

H

Class(X)

Results

Combining Speech Recognition and Speaker Verification.

Speaker independent phone HMMs Selection of segments or segment classes

which are speaker specific Preliminary evaluations are performed on the

NIST extended data set (one hour of training data per speaker)

Selection of nasals in words in -ing

being everythi

ng getting

anything thing

something

things going

«MAJORDOME»

Unified Messaging System

Eureka Projet no 2340

EDFVecsys

D. Bahu-Leyser, G. Chollet, K. Hallouli , J. Kharroubi, L. Likforman, D. Mostefa, D. Petrovska, M. Sigelle, P. Vaillant

KTH Mensatec UPC Airtel

Software602

Majordome’s Functionalities

• Speaker verification

• Dialogue

• Routing

• Updating the agenda

• Automatic summary

Voice

Fax

E-mail

Voice technology in Majordome

Server side background tasks:continuous speech recognition applied to voice messages upon reception Detection of sender’s name and subject

User interaction: Speaker identification and verification Speech recognition (receiving user

commands through voice interaction) Text-to-speech synthesis (reading text

summaries, E-mails or faxes)

BIOMET

Bla-bla

SECUREDSPACE

PIN PIN 1111111111111111

11

BIOMET

An extension of the M2VTS and DAVID projects to include such modalities as signature, finger print, hand shape.

Initial support (two years) is provided by GET (Groupement des Ecoles de Télécommunications)

Emphasis will be on fusion of scores obtained from two or more modalities.

Conclusions and Perspectives

Evaluation trials (as conducted by NIST) help improve technology.

A strategy combining speech recognition and segmental scoring seems to be a promissing approach for speaker verification.

Whenever possible, text independent speaker verification should be confirmed by text dependent verification.

Whenever possible, fusion of multiple experts (preferably multimodal) should be performed.

Recommended