How Speech Recognition Works

Automatic speech recognition (ASR) requires:

When the input arrives via the audio source, the speech recognition engine processes the input and attempts to translate it into text. The engine uses one or more grammars when translating the audio into text.

A grammar defines a set of words and phrases that can be recognized. It may include rules that predict the most likely sequences of words, or it may define a context that identifies the subject of dictation and the expected style of language. If the engine succeeds in translating the audio input into text, it passes the text to the application.

Learn more about different grammars used by speech recognition engines.

Speech Recognition Modes

All speech recognition involves detecting and recognizing words. Most speech recognition engines can be categorized by how they perform these basic tasks:

Matching Techniques

The degree of isolation between words that the engine needs in order to recognize a word.

Speaker Dependence

The degree to which the engine is restricted to a particular speaker.

Vocabulary Size

The number of words that the engine searches for a match.

Word Separation

The method that the engine uses to match a detected word to known words in its vocabulary.

Matching techniques and word separation are described in more detail below.

Matching Techniques

Speech recognition engines match a detected word to a known word using one of these techniques:

Whole-Word Matching

The engine compares the incoming digital-audio signal against a prerecorded template of the word. This technique requires much less processing than sub-word matching, but it requires that every word be pre-recorded in order for recognition to occur. If the grammar is large, for example several hundred thousand words, this task can be time consuming.

Whole-word templates also require large amounts of storage (between 50 and 512 bytes per word) and are practical only if the recognition vocabulary is known when the application is developed. It is not always easy to predict all combinations of words that a user might utter.

Sub-Word Matching

The engine searches for sub-word matches - usually strings of phonemes - and then performs further pattern recognition. This technique requires more processing than whole-word matching, but it also requires much less storage (between 5 and 20 bytes per word).

Importantly, the pronunciation of the word can be guessed from the translated text without requiring the word to be pre-recorded in its entirety beforehand.

Word Separation

Speech recognition engines typically require a specific type of verbal input in order to detect words:

Discrete Speech

Every word must be isolated by a pause before and after the word - typically about a quarter of a second - in order for recognition to occur. Discrete speech recognition requires much less processing than word spotting or continuous speech, but it is less natural and user-friendly.

Word Spotting

A series of words may be spoken in a continuous utterance, with no discrete pauses, but the engine recognizes only one word or phrase.

EXAMPLE

A word spotting engine listens for the word time. The user says "Tell me the time" or "Time to go", but the engine recognizes only the word time

Word spotting is useful when a limited number of commands or answers are expected from the user and the way that the user utters the commands is unpredictable or unimportant.

Continuous Speech

The engine encounters a continuous utterance with no discrete pauses between words, but it recognizes all the uttered words. Continuous speech recognition is the best technology in terms of usability, because it allows for the most natural speaking style. However, it is also the most computationally intensive because of the difficulty in identifying the beginning and ending of words.