Digitizing Voice Files

Overview

Sound is perceived as pressure variations that can be represented as a continuous waveform with peaks and valleys deviating from the average pressure level. The height of peaks or the depth of valleys relative to the average level is the amplitude of the pressure variation. The amplitude is logarithmically proportional to the perceived volume of the sound. The number of peaks passing in a unit of time is the frequency of the sound, measured as cycles per second, or hertz (Hz).

Sounds are generally composed of many superimposed frequencies. While the human ear can detect sounds in the 50 Hz to 20kHz frequency range, it is generally accepted that human speech can be clearly transmitted in the 300 Hz to 3kHz band.

To store sound on digital media, it is necessary to digitize the continuous waveform. The waveform is sampled at discrete intervals and a measure of the amplitude for each sample is recorded. These discrete samples are used to recreate the continuous waveform and to play the file.

Sampling Rate

The sampling rate, measured in samples per second or Hz, is the number of waveform samples taken per second to digitize the continuous waveform.

Note the distinction between the sampling rate and the audio frequency (also measured in hertz):

The audio frequency is a measure of the number of pressure fluctuations per second in the sound. The higher the sampling rate, the better the sound quality (at the expense of disk space). However, since the standard sample rate used by telephone companies for digital voice transmission is 8000 Hz, there is no advantage in recording voice files at rates above 8000 Hz for telephony applications. 6000 Hz and 8000 Hz are the most common voice card sample rates.

Sample Size

The sample size is the number of bits used to represent the amplitude of each sample. The more bits used, the better the sound quality. The amplitude measure may be a measurement of the complete amplitude of the sample or it may be based on the change in amplitude compared to the previous sample. The common sample sizes used by voice cards are 4 and 8 bits. Sound cards usually use 8 or 16 bit samples.

Voice Transmission Rate

The voice transmission rate, measured in bits per second or thousands of bits per second (kbps), is derived as the sample size times the sample rate. For example, 4 bit samples taken at 8000 Hz give a transmission rate of 32 kbps.

Required disk storage is determined by the transmission rate times the duration of the recording. For example, one hour (3600 seconds) of 32 kbps voice requires 115.2 Mbits or 14.4 Mbytes.

Compression

How can you reduce storage requirements?

Amplitude conversion schemes are used to compress each sample to reduce storage requirements. The conversion schemes most commonly used by sound cards and voice cards for measuring and saving the amplitude of waveform samples are:

  1. linear pulse code modulation (PCM)

  2. non-linear PCM

  3. adaptive differential pulse code modulation (ADPCM)

Compression Types

Linear PCM

This is a linear measure of the amplitude. Each sample is represented by its complete magnitude. Wave (.WAV) files generally used this format (but they can be in different formats as long as the standard wave header accompanies the file).

Non-linear PCM

Because the human ear is more sensitive to variations in amplitude at the low end, non-linear compression schemes produce better results by using more bits to represent small amplitudes and fewer bits to represent larger amplitudes.

Two non-linear schemes are commonly used:

ADPCM

Differential Pulse Code Modulation (DPCM) exploits the fact that successive samples are usually close in magnitude. It records the amplitude of each sample as the change in amplitude relative to earlier samples, rather than measures of each complete sample.

Adaptive Differential Pulse Code Modulation (ADPCM) uses a more complex algorithm based on history to determine the next sample based. Several variations of ADPCM are in use, each with a slightly different compression algorithm.

What formats does VBVoice support?