Advances in AI technology for voice recognition and synthesis over the last 18 months are poised to revolutionize how we interact with machines and each other.
What’s New
Revolutionary changes in Artificial Intelligence (AI), Machine Learning (ML), processors, and architecture will enable a new wave of capabilities. What voice recognition and speech synthesis have been promising for years is finally a reality. SARAH, the smart house from the TV show Eureka, recognizes the inhabitant’s moods, has a voice indistinguishable from a human and anticipates needs. SARAH is possible with today’s technology.
NVIDIA Research on Conversational AI
In September 2021, NVIDIA researchers shared some of their latest work at the Interspeech 2021 Conference. The most critical development that NVIDIA shared was a breakthrough in speech synthesis resulting from a change in perspective. Rather than focusing on the words themselves, NVIDIA researchers began to train the AI to view the speech as music. Like speech, music has a flow with inflection, timbre, tone, and pacing changes. These qualities make speech feel natural when so many generated voices sound monotone and rigid.
NVIDIA’s goal is to create Conversational AI, a voice assistant that can engage in human-like dialogue, capturing context and providing intelligent responses, like SARAH.
Cheaper and Better GPU Processors
Where a Central Processing Unit (CPU) may have four or sixteen cores on which separate operations can run in parallel, a Graphics Processing Unit (GPU) may contain thousands of cores. Leveraging this massively parallel computing power is what allows games like Fortnite to render such engaging 3D environments on the fly. Even though GPUs were originally created to accelerate the rendering of 3D graphics for video and gaming, their impressive processing power attracted the attention of scientists and engineers working on AI.
Today’s GPUs are more programmable than ever before, allowing a broad range of applications that go beyond traditional graphics rendering (Intel).
AI and ML can leverage parallel processing the same way graphics rendering does. They all break down the computations into smaller and smaller bits and perform as many computations as possible in parallel.
The availability of GPU processing is increasing at the same time that the cost is dropping. GPU dedicated servers, designed for AI and ML are now available from all the major cloud providers for less than $1 per hour. That’s not bad considering that processing a typical voice snippet takes about 150ms.
Streamlined Architecture That Skips Text
The typical design for conversational AI has three parts:
- Voice recognition receives the audio and converts the speech to text.
- Natural Language Understanding (NLU) receives the text as input, works to understand the speaker’s intent, and formulate a proper response as text.
- Speech synthesis takes the text from the NLU and converts it to a voice.
Old Flow: Automatic Speech Recognition > Text > Natural Language Processing > Text > Speech Synthesis
The latest developments in conversational AI architecture eliminate the text. Understanding the words is passed from the voice recognition to the NLU without converting it to written text. Once the NLU completes its analysis and formulates a response, that response is delivered directly via voice without a text-to-voice step. Eliminating the text conversions decreases the time needed for the response.
New Flow: Automatic Speech Recognition > Natural Language Processing > Speech Synthesis
Conversational AI is incredibly sensitive to processing time. The average pause between a comment and a response in a conversation between two humans is only about 300ms. In conversational AI, that means that there is only 300ms to transmit the speaker’s voice to the machine, for the machine to process the response and transmit it back to the speaker. In this cycle, every millisecond counts.
Generative Model Codecs
A codec is an algorithm that compresses an audio message as small as possible for transmission. On a digital network such as the internet, the message is divided into packets. Each packet transmits separately. The larger the message, the more packets it takes, and the more likely it is that a packet could be delayed or lost during its journey.
The latest codecs that are now coming into use are based on a generative model. Most codecs compress the message by leaving something out that can be replaced on the other end, such as silent spaces between words. The generative model takes this to the extreme. The generative model reduces the message to a mathematical description of the message. The message is then recreated on the receiving end by a predictive AI that uses the description to guess how the message sounded. The first of these to come into widespread use and perhaps the future standard is Lyra developed by Google.
Historically, the lower the bitrate for an audio codec, the less intelligible and more robotic the voice signal becomes. Furthermore, while some people have access to a consistently high-quality, high-speed network, this level of connectivity isn’t universal, and even those in well-connected areas at times experience poor quality, low bandwidth, and congested network connections. Alejandro Luebs, Software Engineer, and Jamieson Brettle, Product Manager, Chrome
High-efficiency codecs, like Lyra, attempt to mitigate the inherent latency and normal disruption present on the internet. For conversational AI to be effective, every millisecond counts.
What to Look for From the Latest Technologies
With the speed that can work in real-time as you are speaking and the ability to understand context, these cutting-edge technologies enable an experience more like dealing with a human.
When you speak to a virtual assistant like Siri or Alexa, it can usually understand what you are saying one sentence at a time, but you can’t carry on an actual conversation. The IVR at your bank that asks you to press one for checking can’t infer from your voice that you are frustrated and adjust its voice to a more sympathetic tone or offer to cut the menu short and connect you to a live person. However, these are examples of a couple of things possible with the latest technology.
Multi-Turn Conversations
In conversation with another person, there is a back and forth missing from previous AIs. The AI can’t follow the flow of the conversation to know what was said two sentences previous and understand what that means to the current sentence.
It turns out that, when we’re talking or writing or conveying anything in human language, we depend on background information a lot. It’s not just general facts about the world but things like how I’m feeling or how well defined something is. Diwank Tomer, CEO of Whitehead AI
Improvements in NLU will allow AIs to consider more context when formulating a response. This will make the back and forth, which is pretty easy for humans, smoother, and more natural for the AI.
Voice Skins
In some online games, players can purchase or win “skins”. A skin changes the appearance of a player’s avatar in the game. Real-time voice processing allows something similar for voices. For example, you will be able to speak in the voice of your favorite Avenger. You can sound like another person, but the voice will carry your intonation, so it sounds like a real person. The company Modulate is currently beta testing the technology in games and video conferencing with paid actors supplying the voices. No celebrities yet.
Accent Neutralization
A consistent problem with some remote call centers is accents. It can be difficult for customers to understand call center agents working outside of their native language. It can be equally difficult for an agent to understand the customer. Understanding someone through their accent can be a problem in many situations, such as international business meetings or conversations with a healthcare worker. With the latest voice processing technology, it is possible to soften or even remove an accent. It is still the speaker’s voice, but any accent is lessened. Accent neutralization can make understanding quicker and easier.
Voice Changers
Digital voice changers that can make a speaker sound like a robot or alien have been around for a long time. However, with the latest voice synthesis technology, a voice changer can change the quality of your voice and you still sound like a human speaker.
In a 2020 study, the Anti Defamation League reported that 41% of female players and 37% of LGBTQ players of online multiplayer games were harassed based on their gender and sexual orientation. Anna Donlon, the executive producer of the game Valorant, told Wired that she avoids voice chat while playing her own game because of sexist harassment. Trans players are often harassed because their voice does not match the gender of their avatar or profile.
Of course, everyone should be treated respectfully while playing a game, but until that happens, one solution is the latest voice-changing technology. It can change the gender identity of a voice to match whatever the speaker would like. A woman faced with harassment could use this technology to impersonate a male online. A trans man could alter his higher-pitched voice to match his chosen gender. It would still be the speaker’s voice, but different.
You can use the same technology to disguise a voice. Suppose a celebrity with a highly recognizable voice wanted to play an online game. The celebrity can change their voice in the game and others would not recognize them. Other gamers would not be able to detect that it was not a natural human voice.
Voice Chat Moderation
Another solution to the harassment problem is voice chat moderation. Automated text chat moderation is standard in most competitive gaming, but voice chat is more complicated. It tends to move faster and is harder to interrupt in a timely way. An AI moderator listens in on all the chat sessions. Using the latest voice recognition and sentiment analysis, mentioned below, the AI moderator can step in and interrupt the harassment while it is still happening. The AI can choose from various responses, ranging from muting offensive comments or inserting a comment reminding a participant of the rules to permanently expelling a player from the game.
Sentiment Analysis
Sentiment analysis is an extension of NLU. An AI analyzes what is being said and makes judgments about the speaker’s mental state and the intent of their words. The AI uses a variety of data to make these judgments, not just the words used. The AI considers the tone of voice and the pacing. The AI may also use the speaker’s profile and history of previous interactions.
This information can be used in a remote call center by a virtual or live agent to adjust their responses to a customer to improve their experience. In the voice chat moderation mentioned above, the AI would consider not just the words used but the context of the conversation, a derisive tone, and any history of harassment to detect the speaker’s intent.
The Internet: The Last Obstacle to Unleashing the Full Potential of Conversational AI
The network limits the true potential of conversational AI. In natural speech between humans, the average time between responses is about 300ms. We see how network speed affects conversations on news programs when the reporter is in the field with a satellite link to the studio. The news anchor in the studio asks a question, and there is an awkward pause while the reporter waits to receive the question and then delivers the answer back to the studio.
The average lag in a satellite uplink is 638ms, more than double the time we expect during a face-to-face conversation. The time it takes for conversational AI to process speech and respond is about 150ms. That leaves 150ms for transmission across the network without exceeding our 300ms natural response time.
The public internet was not built for this kind of instant communication. If your email takes a few seconds to cross the internet or it takes three seconds for a web page to load, no one notices. Even if your streaming movie takes a few seconds for buffering before it starts, that is acceptable. The internet was built for this kind of bulk file transfer where seconds of lag are not noticeable. The internet was built to move a lot of data, not to move data quickly. An average trip across the internet typically takes between 100 and 300ms. That means that sometimes, even if nothing goes wrong, your call or video conference will have distorted voices, dropped audio, and freezing video. Even without conversational AI, we have all experienced these things. But there is a solution.
Subspace is an alternative to the public internet. Subspace was specifically built from the ground up to transport real-time data like phone calls, gaming, and video conferencing. For these applications, every millisecond counts. When you are publishing a game, setting up a remote call center, operating a video conferencing application, or building SARAH, Subspace has the speed to accommodate the needs of real-time applications and plenty of room left to support the next generation of speech processing.