Overview - Speech Generation Projects

ISTC-SPFD CNR has recently increased its activities in the area of speech generation, and is now focusing on two key areas of research and development, small footprint speech synthesis and very high quality application-specific synthesis.  By way of introduction, speech generation is generally accomplished by one of the following three methods:

  1. General-purpose concatenative synthesis.  The system translates incoming text onto phoneme labels, stress and emphasis tags, and phrase break tags.  This information is used to compute a target prosodic pattern (i.e., phoneme durations and pitch contour).  Finally, signal processing methods retrieve acoustic units (fragments of speech correponding to short phoneme sequences such as diphones) from a stored inventory, modify the units so that they match the target prosody, and glue and smooth (concatenate) them together to form an output utterance.

  2. Corpus based synthesis.  Similar to general-purpose concatenative synthesis, except that the inventory consists of a large corpus of labeled speech, and that, instead of modifying the stored speech to match the target prosody, the corpus is searched for speech phoneme sequences whose  prosodic patterns match the target prosody.

  3. Phrase splicing. Stored prompts, sentence frames, and stored items used in the slots of these frames, are glued together.

The strengths and weaknesses of these methods are complementary.  As for speech quality and scope, general-purpose concatenative synthesis is able to handle any input sentence but generally produces mediocre quality. Corpus based synthesis can produce very high quality, but only if its speech corpus contains the right phoneme sequences with the right prosody for a given input sentence. If the corpus contains the right phonemes but with the wrong prosody, the end result may locally (i.e., within the range of a phoneme sequence that was available in the corpus) sound quite good, but the utterance as a whole may have a bizarre sing-song quality with confusing accelerations and decelerations.  And, obviously, phrase splicing methods produce completely natural speech, but can only say the pre-stored phrases or combinations of sentence frames and slot items;  naturalness can be a problem if the slot items are not carefully matched to the sentence frames in terms of prosody.

An additional issue to consider is the amount of work required to build a system.   The cost of generating a corpus or an acoustic unit inventory is significant, because besides making the speech recordings, each recording has to be analyzed microscopically by hand to determine phoneme boundaries, phoneme labels, and other tags.  Such time consuming  analysis is not necessary for phrase splicing methods.  On the other hand,  applications involving names may be prohibitive for phrase splicing methods (In Italy, there are ??1.5?? million distinct last names!).

A final consideration is size.  Although the prices of memory and disk space are continually dropping, being able to have more channels on a given hardware platform translates directly into increased profits, and there is also an increasing interest in using speech synthesis on handheld devices.  Thus, size still matters.  Concatenative synthesis has the edge on size.  Moreover, its quality limitations are less of a problem because  the acoustic capabilities of handheld devices are themselves limited.

In other words, each of these methods has problems with quality,  scope, the amount of resources required, or size.  ISTC-SPFD CNR is focusing on the following projects:

Our software is integrated in an existing TTS engine that has a sufficiently rich internal data structure, such as FESTIVAL and OGI Re-LPC or MBROLA for FESTIVAL.

top.gif (983 bytes)

For more information please contact :

Piero Cosi Istituto di Scienze e Tecnologie della Cognizione - Sezione di Padova "Fonetica e Dialettologia"
CNR di Padova (e-mail:


working.gif (1843 bytes)