Speech Recognition

Remember C-3PO, the talking human like robot from the Star Wars Saga? The gold plated Tin humanoid could just do anything that he was told to. Most importantly, he could understand languages and respond, that too in over six million forms of communication. Six million forms might still be a far fetched goal, but the possibility of a total hands-free vocal control and operation over cognitive machines beckons the scientists and the corporate.

Fig. 1:

Representational Image Showing Speech Recognition Technology
Speech Recognition technology refers to the recognition of human speech by computers and then performing a voice initiated program or function. The challenge that is handled so easily by the human brain, of interpreting speech amidst all accents, pitch, tone, articulation, nasality, vocalizations and pronunciation is a challenge when a computer tries to do it. Moreover, the natural voice generation process in humans is a non-linear process which is not only under conscious control but is subject to variations based on factors as diverse as gender, upbringing or the emotional condition. This pattern is further distorted by the presence of noise and echoes in the surrounding environment.

Another challenge is that the speech is seldom discreet; it is always a continuous stream of words, with the pauses in between which are hard to discern. The classic example demonstrating this is to say the words Recognize Speech in varied speeds. Without appropriate pauses, it sounds like Wreck-a-nice beach. The presence of homonyms further aggravates the situation. This not only offers plenty numbers for the processors to crunch, but ample food for thought to innovators and scientists for devising novel means of improving upon the prevailing technologies and developing them into a state of art.

Fig. 2: Summarizing Classification of American English into Sound Clasess

History

The first attempt at Speech Recognition was made at least 50 years before the digital computers were invented. Graham Bell, in an attempt to aid his deaf wife in understanding what people said, tried to make a device that would produce visual images out of the words uttered into the machine. While he managed to produce spectrographic images of sound, his wife couldn’t decipher them. This however, led to the invention of the telephone.

It was not before the advent of digital computers that further serious attempts were made in speech recognition technology. In 1952, Bell Labs introduced the first ‘Automatic Speech Recognizer’ named ‘Audrey’. It could only recognize the first 10 digits with 97 to 99% accuracy, provided the speaker was male, spoke with 350ms delay between words and Audrey had been adjusted to user’s speech profile. In other cases, accuracy was about 60%. The principle behind Audrey, of recognizing phonemes, served as a reference model for the barely successful research for many years to come. It was the collective works of Noam Chomsky and Halle in phonology over the idea of generative grammar, that language could be analyzed programmatically, that led mainstream linguistics to switch over from phonemes to breaking down the sound pattern into smaller, more discrete features.

Years of futile research led to shutdown of further research at the Bell labs for almost 10 years. However defense research agency ARPA continued research during that time and under their sponsorship, ‘Harpy’ was born at Carnegie Mellon University. Though it was slow, far from real time and required training, it did recognize connected speech within the ambit of 1000 words. It used Hidden Markov Models which are still the most popular model for Speech Recognition.

In 1980’s and 1990’s DARPA (Previously ARPA) floated the same challenge with more stringent performance rules and the results reduced the Word Error Rate from 10% down to a few percent. Another well established school of thought, Artificial Neural Networks, believed that Speech Recognition was basically pattern recognition and brain-like models could possibly lead to brain-like performance too, another dimension to research on speech recognition sprung up.

Microsoft released speech recognition system compatible with Office XP. It too required training, and static environment, and worked for a single user. Further, while demonstrating the Speech Recognition capabilities of Windows Vista, the system performed well while opening and accessing files, but when it came to transcribing documents, it wasn’t very accurate. As the field continues to thrive, more and more companies have started emerging. Dragon Naturally Speaking from Nuance is a popular Speech to Text software. Other companies that compete in this technology include NICE Systems, Verint Systems, Vlingo, Unisys, ChaCha, SpeechCycle, Klausner Technologies, Sensory etc.

Classification

Speech Recognition (SR) can broadly be classified into two categories:

1. Small Vocabulary/ Large User Base: Good for automated tele-services like voice activated dialing and IVR, but the usable vocabulary is highly limited in scope to certain specific commands.

2. Large Vocabulary/ Small User Base: Suited for environments where small group of people is involved. It however requires more rigorous training for that particular user group and gives erroneous results for anyone outside that group.

Fig. 3: Typical Block Diagram Showing Working of Speech Recognition System

The current methods rely on mathematically analyzing the digitized sound waves and their spectrum properties. The process involves the conversion of the sound waves spoken into the microphone (at 16KHz) into a digital signal through quantization and digitization following the Nyquist-Shannon Sampling theorem, which simply put, requires at least one sample to be collected for each compression and rarefaction consecutively. This means that the frequency of sampling should be at least twice the highest frequency component in the signal. The speech recognition program then follows various algorithms and models to account for variations and compressing the raw speech signal to simplify processing. The initial compression may be achieved through many methods including Fourier Transforms, Perceptual Linear Prediction, Linear Predictive Coding and Mel-Frequency Cepstral Coefficients.

Fig. 4: Graph Showing Segmentation and Labelling for Word Sequence Seven-Six in Speech Recognition

There are commonly four common concepts about which speech is recognized:

1. Template Based: Predefined templates or samples are created and stored. Whenever a user utters a word, it is correlated with all the templates. The one with the highest correlation is then selected as the spoken word. It isn’t flexible enough to understand voice patterns. Discrete Time Warping may be considered as one of these techniques.

2. Knowledge based: These analyze spectrograms of voice to collect data and create some rules which are indicative of the uttered command. These do not use language knowledge base or speech variations and are generally used for command based systems.

3. Stochastic: Speech being a highly random phenomenon can be considered to be a piecewise stationary process over which stochastic models can be applied. As stated earlier, this is one of the most popular methods used by commercial programs. Hidden Markov Models are an example of stochastic methods.

4. Connectionist: Artificial Neural Networks are used to store and extract various coefficients from the speech data over multilayered structures and various neural nets to deduce the spoken word.

The performance is generally measured in terms of accuracy and speed. The general scales are that of Single Word Error Rate, which is the misunderstanding of one word in a spoken sentence, and Command Success Rate, which is the accurate interpretation of the spoken command. Different methods always give varying results which further depends on various external factors.

Different Models

Dynamic Time Warping

It is one of the oldest and most important methods of speech recognition. The underlying philosophy is of template matching. It too needs pre-processing steps to match some of the timing and other constraints with the template before it is put to real test. It forms a general class of algorithms often called dynamic programming. The variation in speed of the speech and pauses is accounted for. The basic form of DTW involves finding the most optimal path on a Euclidean plane drawn between the reference template and the speech segment. The one with the lowest score is said to be the best match. Many other variations exist for this algorithm.

Hidden Markov Model

Fig. 5: Figure Showing Statistical Models of Hidden Markov Model in Speech Recognition

These are statistical models in which phonemes are treated as links in a chain, and the completed chain comprises a word. The chain branches off like a tree in various directions according to the possibilities of different word formations and best path Algorithms like Viterbi Algorithm form the ‘Most likely’ estimate of what the next phoneme might be. The numbers of possibilities grow very rapidly as the number of words in the vocabulary increase, for example, a 60,000 words dictionary can have 216 trillion possibilities. This makes the process extremely computation intensive. It uses language and acoustical models which have been developed over its training, and are available worldwide from organizations like NIST, Linguistic Data Consortium etc. and also in the form of toolkits.

These are like finite state models where each state has a statistical distribution of the sounds that are likely to be heard in that segment of speech, thus expressing the likelihood of their occurrence. These distributions may be modeled into simple shapes like a Gaussian surfaces and then computing the parameters. Each phoneme has its own, different distribution curve. The chain is formed by concatenating individual HMM for separate words. Cepstral normalization may be used to counter the diversity of speakers, vocal tract length normalization for male-female voices, and maximum likelihood linear regression for general speaker adaptation.

Connectionism

Connectionism/ Artificial Neural Networks

After gaining much success in classifying the speech segments as voiced/unvoiced or nasal/plosive, researchers moved on to phoneme classification which achieved very competitive results. There are two approaches towards the speech recognition problem: Static and Dynamic.

In static approach, the whole voice segment is considered at once and decision is made. Inputs are applied to multilayer perceptrons with hidden units, which also function as feature detectors and help in the classification of important classes of sound like vowels or consonants with high accuracy. The classification decision is then taken as output.

Fig. 6: Figure Showing Dynamic Approach (Time Delay Neural Network) in Speech Recognition Problem

In Dynamic approach, methods like Time Delay Neural Networks (TDNN) and Recurrent Neural Nets have been used. Here, the neural network makes a local decision seeing a small window in contrast to static methods where complete frame is used, and the decision is then integrated to get a global decision. Where static approach seems to give good results in phoneme classification, dynamic approach fairs better with words and sentences.

Current Scenario

Speech recognition was initially built for doing the work of a medical transcriptionist. No wonder it was not possible at that time given the infrastructure and the limited advance of technology, it now seems to be gathering steam around developers again, especially military. Various militaries are not only putting great effort in improving the technology for medical purposes but also for achieving a tactical edge in combat machinery. Fighter jet cockpits are being fitted with SR devices which can help the pilot do various non-critical tasks with vocal commands. Its performance at high G’s is being tested and worked upon.

US military’s F-16, French Mirage, UK’s Eurofighter Typhoon and Swedish Gripen are all examples of aircrafts where such technology is being deployed. In Helicopters, the Pilot seldom wears a face mask thus involving more background noise making it harder for the SR system to interpret the commands. This is an additional challenge to make a robust system. Air Traffic Controller trainers are employing such techniques to replace the actual pilot who otherwise has to interact with ATC trainees, thus reducing the precious workforce required for such tasks. Microsoft’s Tellme and Yahoo’s oneSearch have been constantly providing improved voice searching capabilities in some parts of the world. IVR systems are constantly evolving and so are desktops with SR capabilities.

Modern vehicles are being fitted with SR systems to provide enhanced accessibility. Ibn Sina is being developed as a talking humanoid in an advanced research lab in UAE as a multilingual platform. The horizons are widening with each improvement and a day may soon come when a tin box is actually conversing with you in a heart to heart talk.