Speech, as we know, is the primary avenue of communication between people. But some people are robbed of their ability to speak by neurodegenerative diseases, strokes, and brain injuries. In such conditions, patients are unable to communicate, despite their brains’ speech centers remaining intact. Fortunately, neuroscientists have come up with synthetic speech algorithms that help people generate natural-sounding speech by using brain activity to control an anatomically detailed computer simulation of a virtual vocal tract, including the movement lips, jaw, tongue, and larynx.
Speech synthesis has emerged as an exciting new frontier for brain-computer interface (BCI). Taking artificial simulation of human speech by a computer or other device to another level from basic tasks such as translating text information into audio information, music generation, voice-enabled services, navigation systems, and accessibility for visually-impaired people, researchers predict that people who have lost their ability to speak due to a spinal cord injury, locked-in syndrome, ALS, or other paralyzing condition may very well get their voice back. New studies center on the decoding of electrical activity in the brain to synthesize speech with the help of sound representations embedded in the brain’s cortical area. The aim is to bring together expertise from neuroscience, linguistics, and machine learning to help neurologically disabled patients who are deprived of the ability to speak due to paralysis and other forms of brain damage. Similar to people with paralyzed limbs who can control robotic limbs with their brains, people with speech disabilities may be able to speak again using the brain-controlled artificial vocal tract.
Speech decoding then and now
Rapid advances in speech synthesis mark significant breakthroughs over past efforts that focused on harnessing brain activity to allow patients to spell out words one letter at a time. These devices allowed people with severe speech disabilities to spell out words using small facial movements and other techniques. They learned to spell out their thoughts letter-by-letter using devices that tracked tiny eye or facial muscle movements. While these technologies turned out to be quite useful, they were found to be very time consuming to communicate.
Producing text or synthesized speech with such devices was found to be very arduous, error-prone, and painfully slow, usually permitting reproduction of 10 words per minute, compared to the 100-150 words per minute of natural speech.
To do away with these limitations, researchers have developed a BMI that can translate activity in the speech centers of the brain into natural-sounding speech. A study, conducted by UC San Francisco neuroscientists, has successfully utilized data from patients who were under monitoring for epileptic seizures, with stamp-size arrays of electrodes planted directly on the surfaces of their brains. The experiment marks the latest in a rapidly developing effort to map the brain and engineer methods of decoding its activity.
How the BCI work in speech synthesis
The BCI has the potential to help speech-impaired patients to “speak” and emerge as a stepping stone to neural speech prostheses. The device monitors and converts the brain activity of the user to a natural-sounding speech using a virtual vocal tract. Though the way the speech centers coordinate the movements of the vocal tract is complicated, the system is geared towards creating a synthesized version of a person’s voice by controlling the activity of their brain’s speech centers.
The electrodes monitor slight fluctuations in the brain’s voltage, which computer models learn to correlate with their speech. This synchronization is achieved by way of connecting brain activity with a complex simulation of a vocal tract—a setup that builds on recent studies that focus upon the encoding of movements of lips, tongue, and jaw by brain’s speech centers.
How virtual vocal tract is leading to naturalistic speech synthesis
The human brain’s speech centers choreograph the movements of the lips, jaw, tongue, and other vocal tract components to produce fluent speech. As speech centers encode movements rather than sounds, researchers are trying to do the same in decoding those signals. They build upon linguistic principles to reverse engineer the vocal tract movements needed to produce those sounds, for instance, pressing the lips together, tightening vocal cords, shifting the tip of the tongue to the roof of the mouth, subsequently, relaxing it, and so on.
The anatomical mapping of sound lets scientists creates an authentic virtual vocal tract for every participant. Controlled by the user’s brain activity, it consists of two neural network machine learning algorithms. One of them is a decoder that transforms brain activity patterns produced during speech into movements of the virtual vocal tract, and another is a synthesizer, which transforms these vocal tract movements into a synthetic estimation of the participant’s voice.
The near-natural speech generated by these algorithms is significantly superior to the synthetic speech directly decoded from participants’ brain activity without the inclusion of simulations of the speakers’ vocal tracts. Notably, such algorithms can create sentences that are understandable to hundreds of human listeners in crowdsourced transcription tests conducted on web platforms.
Different systems of speech synthesis
Speech synthesis implies the automatic generation of speech waveforms. It has been under development for several decades. It was neuroscientist Frank Guenther of Boston University who developed the first speech BCI in the year 2007. The system implanted electrodes in the brain of a man suffering from locked-in syndrome to eavesdrop on the motor cortex’s intention to speak. It used signals corresponding to the movement of the tongue, lips, larynx, jaw, and cheeks in a way that would produce particular phonemes, though the study did not get beyond vowels.
Recent progress in speech synthesis has produced synthesizers with very high intelligibility though the sound quality and naturalness remain challenging propositions.
The application field of synthetic speech is expanding fast, increasing possibilities for people with communication difficulties. Synthesized speech provides the vocally handicapped an opportunity to communicate with people who are unable to understand the sign language. These people can convey emotions, such as happiness, sadness, urgency, or friendliness by voice by tools such as HAMLET (Helpful Automatic Machine for Language and Emotional Talk).
Most new approaches to speech synthesis involve deep learning. WaveNet is a neural network for producing audio that’s very similar to a human voice. The model is fed voice samples to aid in predicting the next one. The model is evaluated on multispeaker speech generation, text-to-speech, and music audio modeling. The MOS (Mean Opinion Score), used for this testing, measures the quality of voice.
Another speech synthesis model is Tacotron that synthesizes speech directly from text and audio pairs, which makes it very adaptable to new datasets. The model comprises an encoder, an attention-based decoder, besides a post-processing net.
Tacotron 2, an advanced neural network architecture for speech synthesis directly from the text, has been built by amalgamating the best features of Tacotron and WaveNet.
Deep Voice 1 is a text-to-speech system developed using deep neural networks. It meticulously synthesizes audio by combining the output of the grapheme-to-phoneme, phoneme duration, and fundamental frequency prediction models.
Deep Voice 2 is a multi-speaker method for augmenting neural text-to-speech with low dimensional trainable speaker embeddings. It produces various voices from a single model.
The model represents a significant improvement in audio quality over DeepVoice 1. It can learn hundreds of unique voices in less than half an hour of data per speaker.
Then we have Deep Voice 3 that introduces a fully-convolutional attention-based neural text-to-speech (TTS) system. It uses a fully-convolutional character-to-spectrogram architecture that enables fully parallel computation. The architecture can transform textual features such as characters, phonemes, and stresses into different vocoder parameters.
Facebook AI Research has developed a voice fitting and synthesis via a phonological loop. The company’s “thought to typing” BCI research is intended to develop a silent speech interface that will let you produce text five times faster than typing, or 100 words per minute. The company is studying whether high-quality neural signals detected noninvasively can be accurately decoded into phonemes. In the future, its next step could be to feed the signals into a database that pairs phoneme sequences with words, followed by the use of language-specific probability data to predict which words the signals most likely mean (similar to auto-fill in Gmail). VoiceLoop relies upon a memory buffer instead of the conventional RNNs. Memory is shared between all processes and using shallow, fully-connected networks for all computations.
Significantly, researchers all over the world are busy experimenting with higher-density electrode arrays and more advanced machine learning algorithms. These activities are expected to improve the synthesized speech even further.
Efforts are underway to not only restore fluent communication to individuals with a severe speech disability but also reproduce some of the musicality of the human voice that could express the speaker’s emotions and personality.
New anatomically based systems have the advantage in decoding (articulating words from brain signals) new sentences from participants’ brain activity nearly as well as the sentences the algorithm is trained on. In one of the instances, researchers provided the algorithm with brain activity data recorded. During this process, even as one participant simply mouthed sentences sans sound, the system was nevertheless able to generate comprehensible synthetic versions of the mimed sentences in the speaker’s voice.
It has also been found that the neural code for vocal movements partially overlapped across participants. Further, a research subject’s vocal tract simulation has the potential to respond to the neural instructions recorded from another participant’s brain. Significantly, such findings suggest that individuals with speech loss due to neurological impairment may be able to learn to control a speech prosthesis modeled on the voice of someone with intact speech. Not surprisingly, neuroscientists are collaborating with electrical engineers to evolve a system of implants, decoders, and speech synthesizers that would decipher a person’s intended words, as encoded in brain signals, and convert them