Italian Diphone Database    top.gif (983 bytes)

A recording of a new Italian syntesis database with a male speaker (P.C.) has been executed at ISTC-SPFD CNR, while a similar recording with a female speaker (L.P.) has been executed at ITC-irst. The larynograph signal (electro-glottal graph - EGG) has been recorded too for a better pitch extraction. The speaker reads a set of carefully designed nonsense or true Italian words embedded in syntactically correct but semantically incorrect sentences which have been constructed to elicit particular phonetic effects. This technique ensures that the collected database only contains the required variability. Various scripts for automatic segmentation, diphone extraction and LPC analysis have been developed with the function of making faster the creation of a new voice. 

The database has been formatted in FESTIVAL, OGI Residual LPC and MBROLA synthesis format. 

Text/Linguistic Analysis    top.gif (983 bytes)

Various modules have been constructed for:

  • lnput texty-string processing;
  • equivalent characters mapping;
  • distinction and processing of numerical data and function word;
  • letter to sound module: phonetic transcription;
  • syllabification;
  • compilation of a lexicon: it contains approximately 500000 word-forms with their part-of-speech (POS) specified.

 click on the figure to enlarge

Prosodic Analysis    top.gif (983 bytes)

The control of prosody has a central role in TTS synthesis, in fact, one the most pressing problems in TTS is that of intonation. This divides into two areas: deciding what intonation the system should use for an utterance and the realisation of that intonation into a fundamental frequency contour. Traditionally two approaches have been used for the front end (that is, "text" or "linguistic" analysis) of speech synthesizers. The first type uses sophisticated rules to parse and tag the text. Although theoretically justified, algorithms developed to date have been so unreliable and unwieldy, that many have tried the second approach, whereby a front end is hacked together and very simple (sometimes statistical rules) are used to detect where phrasing should be placed etc.

The task of a prosodic module (see the following Figure) in a TTS synthesizer is to compute the values of a set of prosodic variables, starting from the linguistic information contained in the text that has to be synthesized. In up to date TTS technologies, synthesis control has been mainly focusing on phoneme duration and pitch, which are the two main parameters conveying the prosodic information.

click on the figure to enlarge

Rule-based approach    top.gif (983 bytes)
A rule-based prosodic duration module has been designed to superimpose specific duration to each diphone. A phone standard duration has been determined for each diphone and these durations are modified on the basis of the phone position inside the phrase and the word. Two simple prosodic intonation modules, one for declaratory sentences and the other for question sentences, have been built making use of the stress cue and of the function-word cue previously obtained.

The rule-based prosodic module for Italian Festival is quite simple and relies essentially on punctuation marks and function words. Each phoneme is assigned a mean duration, which was statistically computed by analyzing a wide corpus of Italian sentences produced by various RAI Italian television announcers. The duration of stressed vowels is augmented by 20% relative to the average vowel duration. Pauses between words are divided in 2 categories: short pauses of 250ms, associated with some punctuation marks such as [ ' \ , ; ] and long pauses of 750ms associated with main conclusive punctuation marks such as [ ? . : ! ]. As for intonation, declarative sentences are segmented in intonational phrases each of which is assigned a baseline starting at 140Hz and ending at 60Hz (for a typical male voice). For any stressed syllable, the f0 contour is raised by approximately 10Hz over the baseline, while the last syllable has a steeper inclination relative to the baseline. A resetting of the baseline is executed on the function words. As for interrogative sentences, a falling-raising pattern is associated with the last word. A “Target Point” (TP) is assigned to the last stressed vowel, and is aligned at 3/4 of its duration: at that point the f0 curve reaches a value corresponding to 80% of the baseline, falling from a value equal to the baseline assigned to the end of the preceding vowel. Starting from TP, f0 raises up to f0max with an inclination that spans over the post-tonic unstressed syllables. The last syllable is assigned a faster speed.

Statistical CART-based approach    top.gif (983 bytes)

A CART is a statistical method for predicting data from a set of feature vectors. In particular, a CART is a binary branching tree (see the following Figure)with questions about the influencing factors at the nodes and best predicted values at the leaves. The tree contains yes/no questions about the features and pro-vides either the probability distribution or a mean and standard deviation. The building of decision trees is obtained by finding the question that splits the data minimizing the mean “impurity” of the partition and the impurity is small when the items are similar.

click on the figure to enlarge

The advantages of CARTs are that standard tools for their generation are widely available, and that the computed regression tree is interpretable. The disadvantage lies in the fact that it needs a large amount of training data. As for Italian Festival, two CARTs were trained in order to identify correlations between linguistic information and duration and intonation contours from two set of training data. The duration of phonemes and fØ values of syllables (start/end of the syllable and mid of the vowel) were independently predicted by two CARTs using two corpora of different type of natural speech: a news-reading style corpus, spoken by a national TV announcer and a more elicited child-story-reading style corpus spoken by an Italian actor. No intonation-type transcription such as that inspired by ToBi  or Tilt intonation theories was considered and only text-type segmental, lexical and syntactic information, as indicated in the following Table for duration and for intonation (f0), are used while building the classification trees.

For an homogeneous treatment of the data, that is to factor out the influence of the intrinsic duration and intonation, the absolute values were first converted to z-scores, and the mean and the standard deviation of each sound were stored in a separate file.

CARTs were trained on the training corpora with the program “WAGON”, a tool, from the Edinburgh Speech Tools Library, available with the FESTIVALSpeech Synthesis system.

Waveform Synthesizer    top.gif (983 bytes)

Various waveform synthesizer have been utilized:

  •  FESTIVAL diphone based residual excited LPC    top.gif (983 bytes)

FESTIVALis diphone-based synthesis system utilizing the Residual-Exited LPC synthesis tecnique.


  •  OGI diphone based residual excited LPC for FESTIVAL    top.gif (983 bytes)

OGI RE-LPC is diphone-based synthesis system utilizing a new OGI specific Residual-Exited LPC synthesis engine.


  •  MBROLA is a diphone based PCM 16bit/16kHz synthesizer.    top.gif (983 bytes)

MBROLA is a speech synthesizer based on the concatenation of diphones coded as PCM 16 bit linear signals. It takes a list of phonemes as input, together with prosodic information (duration of phonemes and a piecewise linear description of pitch), and produces speech samples on 16 bits (linear), at the sampling frequency of the diphone database used (it is therefore NOT a Text-To-Speech (TTS)synthesizer, since it does not accept raw text as input). This synthesizer is provided for free, for non commercial, non military applications only.

top.gif (983 bytes)

For more information please contact :

Piero Cosi Istituto di Scienze e Tecnologie della Cognizione - Sezione di Padova "Fonetica e Dialettologia"
CNR di Padova (e-mail:


working.gif (1843 bytes)