③proceedings of ICSP Seoul， KoreaGÞ

③proceedings of ICSP '97. A山t 26-28. 1997 Seoul， KoreaGÞ

SPEECH MORPHING BY PROGRESSIVE

INTERPOLATION OF SPECTRA

Hideki BANNOi . Shüuji I\:A.JITAt. I\:a.zu:va TAI\:EDAt Kiyohi1'o SHIKANOi a.lld Fumitada ITAKURAt

↑N a.ra. Insti t u te of Sciellce aIld Technoloß-Y Takayama 8916-;). Iküllla. Na1'a 630-01 .JAPAN

{h ideki-b，sh i ka n 0 } @is.aist-nara.ac.j :): G1'aduat回θSchool of Er口19引lllee白n口II屯8ι句Nagoya Ulli討ve目1's幻itり y.

Fro-cho l. Chikusa-ku. Nagova 464-01 JAPAN {kaji， takeda， ita}@nuee.nagoya・u.ac.J P

ABSTRACT

A speech morphiug， cOlltiuuous al teruatioll of speech waveforms of differeut two speakers. algorithlll bωed Oll prog1'essive iuterpolatiou of spectral eu velope aud source sigual is proposed. The basic scheme of th!' lIlorphillg is 1)負ud correspolldeuc!' of ullit wa.veforms of original aud targ巴t speech. 2) separate speech spectra to euvelope aud SOUl'C!' excit.atioll cornpollellts守3)自nd correspoudellce of spectral chaullels of origiual aud target speech for each euvelope、4) illterpolate both source Sigllal alld euvelope. a.lld 5) COllstruct unit waveform aud geuerates rnorphillg speech by PSOLA. Iu the objective test， proposed rnethod can reduc!' the spectral distortiou by 1.9 dB cornpa1'!' to th!' rnethod based 011 progressive substituti01l of spectra..

when it is used fo1' iuterpolatiug two vowels iu real speech. The effectiveuess of the rnethod is also COllfirmed by subjective test in which 1ll0re thall 90 ty.， (male to female rnorphiug) 01' 60 '1.， (femalp to lllale

morphiug) subjects preferred the proposed method.

1. INTRODUCTION

Morphiug is a cornmou techllology ill the field of ιomputer graphics to a.lter all image of辻II object to allother object cOlltinuously. The￥allle killU of

processillg of speech‘ speech lllorphi1lg. lta.語 he!'ll

illtroduced as contilluouS alternatioll of voicでillUi

vidual町from speaker to speaker [1]，[2].[3].[4]. Rトcelltly， Abe et al. propsed 礼 spepch lllorpltill!?; metltod based Oll progressiw substitutioll of slJE>ctr辻(PSS)[5]. 1u the rnetl山i‘higher freq凹川'仁olllpollellt当

thall the give frequellcy bOUlld礼1・Y 1出話uhsriwt.eu b\

t礼rget speaker's spt'ech with rOlltilluUIl国l\' dl礼ll�lllg

I-Il)

the boulldary frequency as time goes on.

III order to realize sp伺ch rnorphiug. we have to develop a pa1'amet1'ic i1lterpolation of speech characte山tics which cau ge1lerate the iutermediate speech which has 1) as srnooth chaugiug出 possible ill time domaiu aud. 2) as uatural sltape as possible in freque1lcy domai1l. From these viewpoi1lts. the simple substitutioIl of spectra iu PSS rnethod seems to be too sirnple to ge1lerate srnooth and natural spectra.

III thおpa.per二we propose a lle\y algorithm of speech lllor目phillg which 1) interpolates spectral ellvelope a1ld source excitation 110t only illdepelldeutly but also pitch sYllchronously alld 2)出sociates origillal aud target ullit waveforms ill both t.ime alld frequeucy domaills by meaus of DTW. The resultaut perfonnallce of both objective alld subjective tests shows tlrat the proposed method call gellerate better lllorphillg speech thau the PSS method.

2. ALGORITHM

The detailed algorithm of the propsed morphillg

method is described ill this sectiOll.

First. speeclr siguals of speaker A and B司or to say origiual alld target speaker. for the same utterauce arp r 引 orded. For e閃a似訂ch北沿hs叩pe伺ech si氾gIla心1s， time maむrksωs ar閃f

唱glvr門11 t.o the peak孔rロmplitude locatioll of a.ll uui比t waveトform cωorre何spo川01吋i山n時g t凶O 辻 p戸it伏ch p戸閃e引町r巾.per口iod waveform. which we call pitch waveform司ISextracted by applyiug wiudow ceutered at the pi tch 111礼rk locatio ll alld is used for the basic uuit of speec ll

11l0rphillg. .-\出for uuyoiced segmeut. the same killd of w孔v!'form Ulli t is also extracted. whereas the fixed Illrrn・礼1 wiudowillg i亘 adopted

61

一一一一一一一一一一一一

川一一一一一一一一一一一

-:::一一一一一一一一

JJJM一一一一一一一一，

言"52E

百世指

pitch閃nω

T“ 一ム TJU ilor symm

AA ̂ f\"" �夫xV V V v��VV

T:OI TiM

主主a“百孟減邑的Figure 1: Extracting unit pitch waveform byasymetry

variable length window.

Figure 2: Interpolation of spectral envelope a代er nonlinear warping of frequency axis.

c( η)

Freq.岬官y

_ N-l

会乞 log IX(k)1庁k..

(0三九三N- 1). (2)

Lower alld higl.!er cepstrum coe伍ciellts are used as cepstrum represelltation of spectral envelope， e(π)， alld source excitation s( π) ， respectivel子

(3)

( c(川J (0 :::;η< 7to， iV一九三π三N - 1) ( η) = < 。l (n"， :::; 7t < N一九)

(4)

( 0 J (0壬7lく7to，lV-7tQ主7l三N -1)

s(η) = < 1 c(n) l (71"，壬n < N -7L"，)

To filld time domaill correspoudellce betweell two signals， dynamic time warping is executed by matchillg two sequences of pitch waveforms of speaker A alld B， USillg the LPC cepstrum distallce as the spectral metric. After time warpillg， the speech signals are associated pitch-waveform-bypitch-waveform basis. Therefore， interpolating two correspondillg pitch waveforms of original alld target speech signals will generate a pitch waveform of morphing speech.

Morphing rate、i.e. rate of voice individuality of the target speaker， at each waveform is also decided takillg the morphing proどess period illto accoullt a.s a time domain processillg ill this stage.

Sillce interpolation of pitch period is also lleeded ill speech morphillg， after filldillg waveform associatioll、pitch waveform to be processed is re-extracted adalトtively to origillal alld target pitch period lellgth. As showll in Figure 1， the lellgths of left alld right tapered period are adjusted to the shorter period of origllal and target periods， asymmetrically.

2.1 Time warping

2.3 Interpolating spectral envelope

IlIterpolatioll of spectral ellvelopes of two pitch wavefOl"lllS is done ill the log-spectral represen tatioll of e(1/.)， i.e.

and envelope 2.2 Separating spectral source excitation

(0三k壬N- 1). (5) E(J..:) =乞e(n)e-)持k.. Separating spectral envelope alld sour作目citatiollof each pitch waveform is dOlle by lifterillg FFT ("epstrum coefficiellts. FFT cepstrum coefficiellts‘("(1/.). is calculated by

Before the illterpolatioll. frequellcy a.xis is warped 抽出to associate the formallt frequellcies of each speaker. The result of warpillg is obtailled as辻m込ppillg defillecl by

(1 ) (0 $ k三1'，，' - 1) N-l

X(k) =乞:1:(t)e一府知

1:刃ー

62

kB = Ii(k .. d (o:=:; Å:A， Å:B壬N - 1) (6)

wltere kA all(l kB are tlte il1dec:ies of frequel1cY chal1-l1el of eaclt speaker. Tltis mappiug describes that the frequellcy k of speaker A is associated witlt Ii( A:) of speaker B. Morphillg of spectral euvelope cau. thel1. be realized as the parametric iuterpolatiou of two spectral euvelopes EA(k) and EB(k) as follows.

EM(k) = ( 1 - r)EA(kA) + rEB(Ii(A:A)) (0三k三N - 1) (ï)

where k is also illterpolated versiou of frequellcy defilled by frequency mappillg fuuctiou Ii (k) aud lllorphing rate T

k = ( 1 - T)kA + TB(kA) (o:=:; kA三N - 1). (8)

Fiually， the spectral euvelope of the lllorphillg :;peech is recouvert-ed iuto cepstr・al represeu tatiou. e M. by

Table 1: Analysis conditions for experiments

5ampling frequency Window Window length (Variable ) FFT point Liher Length

8kHz Asymetric Hamming � 32m5 (256point)

256point

1.25百宍百point)

N一2T

t・4一一d

( 13)

� the houlldary frequency A UIlÎt pitch waveform of the morphillg speech， Z:M(t)、("au. thell. be obtaiued by

司taN

<一'κ

く一AU

、lf

''k

nμ

B

'K

I-

-」

『け川4、，EH川UV

-KTKl，，KTKJ

J則一川丸 7則一引く

x一同<X一同

k

」1・K

介1く一

-bn

'LAH

(く一(3

M}MK、

XMい

X，eal--f、EEll

一一'rn

f

An

x

( 14)

_ N-I

eM(n) = もデEM(k)汁k" WllOle sell tellce of moゆillg sp町h is， fillally， gelト日 Z:Õ erated by PSOLA[7].

(0三11.壬�V - 1). (9)

2.4 Interpolating source excitation

The source excitatioll sigllal of morphillg speech is also calculated ÎII terms of morphiug rate alld source excitatioll compolleuts of speaker A and B as follows.

SM(n) = (1- T)sA(n) + l'SB(川(o:=:; 1/.壬iV-1) (10)

2.5 Reconstructing pitch waveform

The cepstrum represelltatioll of morphing speech is obtaiued by superimposillg euyelope alld excitation componeuts of morphing speech by the followÎng equation. (lu geueral， each of e M (11.) alld S M (川COIlsists of lower and higher cepstral coefficients ouly.)

CM(n) = eM(n) + sM(n) (0壬n壬N - 1). (11)

Theu， amplitude spectrulll of the 1lI0rphiug叩ecchis obtained by takÎug FFT of C，\.I(lI).

IXM(，!;:)I =ペト(/1.)1'-)叶

(o:=:; k :::;λ. - 1). (12)

Ou the other hallds. the ph辻se 01' rllf lllorphil1g speeclt is gel1erated by cOl1catel1atillg the pha:;e出pe("trulll of lower al1d higher frequellι'Y of origill礼lλllcltarget spe孔kers USillg

3. EVALUATION TEST

III this sectiou， the proposed method Îs compared with couveutiollal PSS method from both objective alld subjective staudpoints.

3.1 IlIustration of spectral envelopes

First. tlte叩ectral euvelopes of morphiug speech by PSS alld proposed method are illustrated for compむ・.isoll. As for the speech material. a selltelltial speech

(/ a.rayuru geNjitsu 0 subete jibuN uo ho:e 悶j長 image十-tal B)百peakers are used and mo町rph山Îug spectral el口lV巴lopel凶s calculated for the p山h waveform of /生日yuru/of50 'Y.， morphiug rate. Figure 3 shows the resul比tal叫l比t 凶叩pe引【.tra 0ぱf the児e propo白se吋d me凶th加lωod (t凶op剖) aむII吋d t山he PSS l凹tlFr印-刀Oll山II the figt肝U1'告ι‘it is clear that tltere is discontinuity b守t.weell lower alld higher frequellcy regiolls ill the lllorphillg speech of PSS method. This， obviously， is the COll日queuce of tht' simple replacemellt of sperr.ral iuformatiou iu PSS method. On the other haud. siuce the Illorphiug speech is obtailled by moyiug the onglll礼1 formallt frequeucy there is uo discoutilluity. ill rh(' propo�ed method. The above results clarify that tltt' propost'd method cau gellerate more uatur孔1 Illorphiugδpeech thau PSS.

l51

63

Analysis conditions for measuring spectral

8kHz Asymmetric Hamming

~32ms (256point) 12 16

Table 2 distortion

同)-a区、-

一'hu-『dE・I・

-r-

Myd--20-

c一一V一

n-TlL

P』---

u-

亡n-

引

間一一が一

一U

r--n-

マJ

Ca--e-rヲk

il--L-e-1

町一w一w一同一m

r中一O一o-「」-f

叩・「d一d一'一引

T-n-n-rL-p

a-ω-r山一P一E

CJ一Mv-wv-lL一「」

同制剖叫畑一一S師陣..A -ーお，..保軒包ミJ" ':::::-町�

55

50

35

'5

<10 Edτ宮吉宮E冒d

羽

25

2。

IS

10

Table 3: Spectral distortion. 4.0 3.0 2.0 Fr世山智可yfl<Hzl

(a) proposed method

1.0 5 0.0

ロu一口u

Ju-JU

717口---「ζ-A仏寸

Proposed Method PSS Method

p

hL同

Figure 4 shows t!te spectrogram of (a) coarticulatory transition in natuI叫speech， (b) morphillg speech of the proposed method and ( c) morphing speech of PSS method， respectively.' From the figure， the smooth change ill the propsed method can be confirmed.

The measured cepstral distortion is listed in Table 3. As listed ill the table、the resultant distortion is smaller ill tlte proposed method than PSS method by about 2.0 dB. This means that the proposed method can gellerate more natural. or to say more similar to the lIatural coarticulation， morphing speech.

一一…回目Jm創出品酌hh

4.0

Figure 3: Spectral envelopes of original， target and morphing ，speech. Top: Proposed method. Bottom: PSS method.

:y》へとメミ)

SS

C D = (10/ 11l10) お初お

-gu官32EPwE安コ

3.0 Fr時」町yfl<Hzl

(b) PSS method

1.0

百=

IS

5 0.0

10

spectral by evaluátion 3.2 Objective distortion

III subjec:tive test， 11 subjects listelled to morphing !;pee('h of the proposed and PSS methods句to decide which souuds better， The results are listed in Tal.Jle 4. 1u both male to female aud female to male morphiug， the proposed method is prefer-ed to the couveutiollal PSS method. This is due to that there

is mismatch in PSS method betweell lower and higher !;pectral ellvelope because of the simple replacement of sJlectral (・ompollent.

3.3 Subjective test Objective test of the naturallless of morphillg speech is performed by comparillg with natural coarticu latioll between phonemes as follows.

1. Extr乱ct pitch waveforms of the illitial aud the fillal pitch periods of a uatural speech segmeut

of coarticulatory challge betweell phouemeぉ

The transitional portion from /u/ to /0/ of

/ge吋

Table 4: Results of subjective test， Preference score of the proposed method to the PSS method.

2， Generate morphing speech throl1gh illterpolat

ing the above illitial and the final waVefOl'lllS l.Jy

continuously challgillg the morphillg rate

3. Associate eaclt pitch waveform of Ilatural coar

ticulatioll aud morphiug spee仁h91 % 64 %

male→female female→male 4， Calculate the below defilled cepstrullI di国tances

betweell associated two waveforlll討

ー152 -

64

判�州i'W1巾判い

，m叫.1

(a) Co-a rticulation in n atu ral uttera nce.

Jト��附刊刊巾""wr

出叫'1

(b) Morphi時speech of proposed method.

相川附州��1州午

，�可'1

(c) Morphi吋speech of PSS meth od

Figure 4: Spectrograms of naturョI coarticulatory inter-phoneme transentョnd morphing speech.

- 153

4. SU乱1MARY

III this paper、the speech morphiug， coutiuuous alteruatiug of speech sigual is developed based Oll inclepeudeut manipulation of envelope and excitation compollents. 1n the propsed method， both time aud frequency warpillg-s are also introduced to associate origiual alld target signals. Thus， the mismatch betweell origiual aud target speech characteristics‘ which causes the degradatiou of the resultant morphing speech， cau be reduced from the simple replacemeut of the spectral compouents. The effectiveness of the proposed method is confirmed from both subjective and objective tests

ACKNOWLEDGEMENT

Th{' authors are grateful to Dr. Abe for his com・ment on this work. The authors are also grateful to

Dr. Kawai for pitch marking algorithmむld the pitch marked speech materials.

REFERENCES

[1] E.Tell皿an， L.Haken担d B.Holloway: "Timber mor・phing using the lemur representation"， ICMC94 Proceeding， pp.329-330， 1994.

[2] E.Tellm姐， L.Haken and B.Holloway:“Timber mor・phing of sound with unequal numbers of features" ， Journal of the AES， Vo1.43. No.9， pp.678-689， 1995

[3] M.Salaney M.Slaney， M.Covell and B.Lassiter: "Automatic Audio Morphing"， Proceeding of 1996 ICASSP， Vo1.2， pp.1001-1004， 1996.

[4] N.Osaka. A VO悶quality interpolation of speech vowels using a sinusoidal model. Rec. Fall M eeting， ASJ. 2-1・10， pp.263-2臼， 1995(in Jap姐ese).

[5] M.Abe. Speech morphing by gradually changing 五mdamental frequency 回d spectra. Rec. Fall M eet

ing. ASJ. 2・1・8. pp.259-260， 1995(in Jap回目e).

[伊阿6伺] H.Ka制waむi担d S.Y阻担0仰tωo. C白ons批山t凶ing a Wavefo1'm Inventory for Text-t←Speech Synthesis Taking Account of Funda皿ental Frequency and Phoneme Duration. Tec九nical Report 01 IEICE， SP95・7，

pp.4ï -52. 1995( in J apanese).

[7] F.Cha1'pentie1' and E.Moulines. Pitch Synchronous Wavefo1'm P1'ocessing Techniques fo1' Text-toSpeech Synthesis U sing Diphones. Proc. Eu内$peech

・89， Vo1.2 ， pp.13-19， 1989

[81 H.Kawai， N.Higuchi司T.Sim.i印刷S.Y:坦坦oto. A Study of a Text-to-Speech System based on WavefOI'm Splicing. Technical Report oIIEICE， SP93・9.

I>p.-19-54. 1993( in J apanese)

65

Documents

③proceedings of ICSP Seoul， KoreaGÞ