Dynamic Time Warping
It is one of the oldest and most important methods of speech recognition. The underlying philosophy is of template matching. It too needs pre-processing steps to match some of the timing and other constraints with the template before it is put to real test. It forms a general class of algorithms often called dynamic programming. The variation in speed of the speech and pauses is accounted for. The basic form of DTW involves finding the most optimal path on a Euclidean plane drawn between the reference template and the speech segment. The one with the lowest score is said to be the best match. Many other variations exist for this algorithm.
Hidden Markov Model
These are statistical models in which phonemes are treated as links in a chain, and the completed chain comprises a word. The chain branches off like a tree in various directions according to the possibilities of different word formations and best path Algorithms like Viterbi Algorithm form the ‘Most likely’ estimate of what the next phoneme might be. The numbers of possibilities grow very rapidly as the number of words in the vocabulary increase, for example, a 60,000 words dictionary can have 216 trillion possibilities. This makes the process extremely computation intensive. It uses language and acoustical models which have been developed over its training, and are available worldwide from organizations like NIST, Linguistic Data Consortium etc. and also in the form of toolkits.
These are like finite state models where each state has a statistical distribution of the sounds that are likely to be heard in that segment of speech, thus expressing the likelihood of their occurrence. These distributions may be modeled into simple shapes like a Gaussian surfaces and then computing the parameters. Each phoneme has its own, different distribution curve. The chain is formed by concatenating individual HMM for separate words. Cepstral normalization may be used to counter the diversity of speakers, vocal tract length normalization for male-female voices, and maximum likelihood linear regression for general speaker adaptation.
Filed Under: Articles