MBROLA – Voice Quality Extentions

MBROLA – Voice Quality Extentions

Internal Report

Department of Phonetics and Dialectology, ISTC-CNR
Institute of Cognitive Sciences and Technology
Italian National Research Council
Via Anghinoni, 10
35121 Padova - ITALY

Staff:
Piero Cosi cosi@pd.istc.cnr.it
Carlo Drioli drioli@pd.istc.cnr.it
Fabio Tesser tesser@pd.istc.cnr.it
Graziano Tisato tisato@pd.istc.cnr.it

Contact: Carlo Drioli

Motivations

The necessity of voice quality control in the MBROLA diphone concatenation synthesis is motivated by the recent studies and experimentations on emotive speech synthesis. It is widely accepted, in fact, that voice quality has a relevant role in the transmission of emotions through speech. Previous attempts to model voice quality in the concatenation synthesis framework have been based on recording separate diphone databases for different levels of vocal efforts or voice qualities. However, memory occupation, complex voice design procedure, and range of the voice quality variety limited to the recorded material, are serious drawbacks. On our side, we faced this task by allowing the online processing of the diphones as an intermediate step of the concatenation procedure (see Fig. 1). This step has been implemented using both spectral processing based on DFT and Inverse-DFT transforms, and time-domain processing for pitch-related effects.

Vol

SpTilt

Shim
AspN

AmpFlut
SpWarp

Jit

F0Flut

Fig. 1: Extensions to the voice synthesis engine (the Mbrola diphone concatenation synthesizer).

Implementation

The MBROLA speech synthesizer, which originally provides controls for pitch and phoneme duration, has been extended to allow for control of a set of low-level acoustic parameters that can be combined to produce the desired voice quality effects. Time evolution of the parameters can be controlled over the single phoneme by instantaneous control curves. The extended set includes gain ("Vol"), spectral tilt ("SpTilt"), shimmer ("Shim"), jitter ("Jit"), aspiration noise ("AspN"), F0 flutter ("F0Flut"), amplitude flutter ("AmpFlut"), spectral warping ("SpWarp"). Studies on how these low-level effects combine to obtain the principal non-modal phonation types encountered in emotive speech are in progress. Here we give a rough description on how these low-level acoustic controls were implemented:

- Gain ("Vol", range: [-60,+10]): gain control is obtained by simply rescaling of the spectrum modulus.

- Spectral tilt ("SpTilt", range: [-1,1]):the spectral balance is changed by a reshaping function in the frequency-domain that enhances or attenuates the low- and mid- frequency regions, thus changing the overall spectral tilt.

Fig. 2: Action of the spectral tilt effect (left: SpTilt>0; right: SpTilt<0).

- Shimmer ("Shim", range: [0,1]): this is the difference between the amplitudes of consecutive periods. It is reproduced by introducing random amplitude modulations to each consecutive periods of the voiced part of phonemes.

- Jitter ("Jit", range: [0,1]): this is the period length difference between consecutive periods. It is reproduced by summing random pitch deviations to the pitch control curves computed by Mbrola's prosody matching module (See Fig. 5.1).

- Aspiration noise ("AspN" , range: [0,1]): for voiced frames, aspiration noise is generated from the frame DFT transform, by inverse transformation of a high-pass filtered version of the spectral magnitude, and of a random spectrum phase.

- F0 flutter ("F0Flut", range: [0,1]): random low frequency fluctuations of the pitch are reproduced as for Jitter. The low frequency fluctuations are obtained by random noise band-pass filtering.

- Amplitude flutter ("AmpFlut", range: [0,1]): random low amplitude fluctuations are obtained as for Shimmer. The low frequency fluctuations are obtained by random noise band-pass filtering.

- Spectral warping ("SpWarp", range: [-1,1]): the rising or lowering of upper formants is obtained by warping the frequency axis of the spectrum (through a bilinear transformation), and by interpolation of the resulting spectrum magnitude with respect to the DFT frequency bins.

- Flutter frequency ("FlutFreq", range: [3.0,50.0]): speed of the amplitude and frequency fluctuations. It tunes the second order band-pass filter used by F0Flut and AmpFlut.

Fig. 3: Action of the spectral warping effect (left: SpWarp<0; right: SpWarp>0).

The Mbrola parser has been modified in order to allow the use of the low-level acoustic controls as general commands or as curves specified at the phoneme level (see the example of an extended .pho file in Fig. 4).

Casella di testo: ;Vol=0
;SpTilt=0.0
;Shim=0.0
;Jit=0.0
;AspN=0.0
;F0Flut=0.0
;AmpFlut=0.0
;;SpWarp=0.3
;FlutFreq=5.0

_ 25 100 143
a1 309 5 151 20 142 40 150 60 141 80 126 100 116 Shim 0 0.1 100 0.2
v 85.3333 0 112 50 118 100 127 Shim 0 0.3 100 0.2
a 334 0 127 20 126 40 118.1250 60 113 80 106 100 148 Vol 0 -3 100 -5 Shim 0 0.2 100 0.4 Jit 0 0.06 100 0.06
_ 10

Fig. 4: Example of an extended .pho file. The spectral warping command affects all phonemes with constant value 0.3, whereas different gain, shimmer and jitter control curves are specified for different phonemes.

Use – rules to write the extended .pho files

If voice quality control is exploited only through commands in the header section of the .pho file, just add a ;;<ControlName = value> line in the header section. ControlName must be one of Vol, SpTilt, Shim, Jit, AspN, F0Flut, AmpFlut, SpWarp, FlutFreq, and value should be within the range corresponding to the control type.

If voice quality control is exploited through phoneme-specific commands, then the following rules must be followed:

1. A command is appended to the right of the phoneme by specifying the command type and the time trajectory using the same convention used for the pitch (see Fig. 4).

2. When appending commands to the right of the phoneme, the following order must be followed: Vol, SpTilt, Shim, Jit, AspN, F0Flut, AmpFlut, SpWarp, FlutFreq. Note that not all commands need to be specified, e.g. Jit can be used after Vol (bun not after AspN).

3. When a command is appended to the right of a phoneme, the time trajectory specification must always begin with the 0% instant, and terminate with the 100% instant.

Examples

Experiments on the reproduction of typical non-modal phonation modalities (in Italian)

A French example:

Frame1 Original Synthesis

New Synthesis