Speech Recognition : Classification

Classification

Speech Recognition (SR) can broadly be classified into two categories:

1. Small Vocabulary/ Large User Base: Good for automated tele-services like voice activated dialing and IVR, but the usable vocabulary is highly limited in scope to certain specific commands.

2. Large Vocabulary/ Small User Base: Suited for environments where small group of people is involved. It however requires more rigorous training for that particular user group and gives erroneous results for anyone outside that group.

The current methods rely on mathematically analyzing the digitized sound waves and their spectrum properties. The process involves the conversion of the sound waves spoken into the microphone (at 16KHz) into a digital signal through quantization and digitization following the Nyquist-Shannon Sampling theorem, which simply put, requires at least one sample to be collected for each compression and rarefaction consecutively. This means that the frequency of sampling should be at least twice the highest frequency component in the signal. The speech recognition program then follows various algorithms and models to account for variations and compressing the raw speech signal to simplify processing. The initial compression may be achieved through many methods including Fourier Transforms, Perceptual Linear Prediction, Linear Predictive Coding and Mel-Frequency Cepstral Coefficients.

There are commonly four common concepts about which speech is recognized:

1. Template Based: Predefined templates or samples are created and stored. Whenever a user utters a word, it is correlated with all the templates. The one with the highest correlation is then selected as the spoken word. It isn’t flexible enough to understand voice patterns. Discrete Time Warping may be considered as one of these techniques.

2. Knowledge based: These analyze spectrograms of voice to collect data and create some rules which are indicative of the uttered command. These do not use language knowledge base or speech variations and are generally used for command based systems.

3. Stochastic: Speech being a highly random phenomenon can be considered to be a piecewise stationary process over which stochastic models can be applied. As stated earlier, this is one of the most popular methods used by commercial programs. Hidden Markov Models are an example of stochastic methods.

4. Connectionist: Artificial Neural Networks are used to store and extract various coefficients from the speech data over multilayered structures and various neural nets to deduce the spoken word.

The performance is generally measured in terms of accuracy and speed. The general scales are that of Single Word Error Rate, which is the misunderstanding of one word in a spoken sentence, and Command Success Rate, which is the accurate interpretation of the spoken command. Different methods always give varying results which further depends on various external factors.