How Does Text-to-Speech Work?

To render text into speech, a text-to-speech engine must first determine the phonemes needed to vocalize a word and then translate those phonemes into digital-audio data.

Phoneme Translation Methods

Most text-to-speech engines can be categorized by the method used to translate phonemes into audible sound. There are two methods: synthesis and diphone concatenation.

Synthesis

A text-to-speech engine that uses synthesis generates sounds like those created by the human vocal cords and applies various filters to simulate throat length, mouth cavity, lip shape, and tongue position.

The voice produced by current synthesis technology tends to sound less human than a voice produced by diphone concatenation, but it is possible to obtain different qualities of voice by changing a few parameters.

Diphone Concatenation

A text-to-speech engine that uses diphone concatenation links short digital-audio segments together and performs intersegment smoothing to produce a continuous sound. Each diphone consists of two phonemes, one that leads into the sound and one that finishes the sound.

EXAMPLE

The word "hello" consists of these phonemes: h eh l œ. The corresponding diphones are:

silence-h

h-eh

eh-l

l-œ

œ-silence

Diphones are acquired by recording many hours of a human voice and by meticulously identifying the beginning and ending of phonemes. Although this technique can produce a more realistic voice, it takes a lot of work to create a new voice and the voice is not localizable because the phonemes are specific to the speaker's language.

Articulation and Prosody

High quality speech synthesis is a technically demanding task. A text-to-speech system must model both the generic, phonetic features that make speech intelligible and the idiosyncratic, acoustic characteristics that make it human.

Natural, human speech is the complex product of a complex vocal apparatus. The vocal cords, uvula, soft palate, hard palate, tongue, teeth, lips, diaphragm, and nasal cavity can all be involved in the articulation of a sound. The relative shapes, sizes, and positions of these organs determine both the phonetics and the audio quality of the utterance.

While speech recognition deals with phonetics alone, a speech synthesis system must also consider prosody. The system must know both what it should say (phonetics) and how it should say it (prosody). The elements of prosody register, accentuation, intonation, and speed of delivery are barely represented in the orthography of a text, but they are important parts of natural-sounding speech and can be crucial to its correct interpretation.

Many text sources also contain abbreviations (mailing lists being a good example). Abbreviations are almost always ambiguous. For instance, is St. a contraction of street or Saint? There is no way of knowing how to pronounce it unless you have a context, such as St. Louis or Market St.

All of the VBVoice supported TTS engines can alter the default text synthesis behaviour through the use of control tags, embedded tags, escape sequences, or powerful fine-tuning tools. Be sure to accommodate prosody into your system if you want the most natural, human-like text vocalization.

Read more about fine-tuning your speech output using different TTS engines.