Invention of the telephone more than 150 years ago triggered a revolution in communications. Today, the voice communications revolution is in the midst of a new quantum leap, as new classes of smart devices make it possible for artificial intelligence (AI) to extract meaning from sound and give people new ways to interact with their world in a more intuitive way. This article examines where we are today and previews technologies that will make ubiquitous voice assistants a natural part of our life.

“Mr. Watson, come here….” 

The famous words uttered by Alexander Graham Bell in 1876 marked the first time that sound was electrically transmitted. This world-changing innovation remains at the center of dramatic changes in how we work, live, and play — and is an integral part of new breakthroughs in how we interact with the world around us.

In its first century, the wired telephone network connected people around the world. Then the electronics revolution of the last 50 years made voice and video conversation wireless and portable. In this decade, we have moved from hands-free telephone conversations between people to conversations with machines. While still rudimentary, this new type of human-machine interaction is driving the next leap in innovation.

Computers, smartphones, and smart speakers now feature built-in voice assistants that use cloud-based deep learning systems to let us ask questions and program actions. The same capability will soon be integrated into other devices we use every day. It is estimated that by 2020, as many as 1.8 billion people will have access to a voice assistant on devices they carry, and in other types of platforms in their homes and even in business environments, according to Statistica.

Yet, the success of voice assistant systems is still challenged by limitations in today’s technologies. Advances in AI, specialized processors, and more sensitive microphones will enhance the performance of voice assistants and accelerate market adoption.

Making conversations human

One challenge facing voice assistant systems is that human conversations are incredibly rich and interactive. Sometimes, a friend may respond to your statements before you even finish a sentence. In technical terms, response times when people talk to each other are measured in tens of milliseconds. While an occasional slow, thoughtful response is very natural when you talk with friends, imagine how awkward your daily interactions would be if the normal conversational gap included delays of up to several seconds or frequent needs to restate a question or command.

The slow pace of voice-assistant “conversation” is related to several aspects of the underlying technology. The algorithms that power voice recognition and response require a lot of processing power, so today’s smartphone and smart speaker systems record and then relay speech to computing resources in the cloud. To minimize the possibility of transmission delays, systems typically transmit low quality audio files, which leads to high error rates. And the Internet itself is a variable speed medium, so the speed of transmission can change. The combination of these two factors will always affect the quality of voice assistants that rely on the cloud to do the heavy lifting of voice recognition.

Even with these drawbacks, consumers clearly are excited about the technology. Sales of smart speaker systems, the first entirely new product after smartphones to offer voice assistants, are growing at a rate not seen since the first smartphones were introduced. Device sales in the U.S. jumped by 40% in 2018 and the 66.4 million new unit sales increased the number of smart speakers to 133 million, representing a little more than 26% of U.S. adults, according to voicebot.ai.

It also is inevitable that voice assistants will continue to get better at emulating conversation. Conversational delay will shrink and improving algorithms will make the interaction seem more like human interaction. A big part of these improvements will come from bringing processing closer to the user.  

Bringing conversation to the edge

The technology that makes cloud-based voice assistants a reality now is advancing at a pace that will make these devices far more personal. Current voice assistants relay information to and from the cloud. Tomorrow, the AI that makes this possible will reside in the edge device, providing benefits in privacy, power consumption, and the responsiveness of the system. In short, edge computing promises to make voice assistants more effective by moving AI from the cloud to our home, to our workplace and to other devices embedded in the world around us. In a step toward this future, Infineon recently demonstrated the world’s lowest power edge keyword recognition solution.

One area of great promise for smarter voice assistants is in medical and personal health monitoring. For example, a high-sensitivity microphone can monitor breathing sounds while sleeping and predict the onset of sleeping disorders such as sleep apnea. Many people may be uncomfortable having this type of personal health information transmitted to the cloud for processing. Edge processing will make it possible to monitor and analyze this information by localizing audio capture, computation, and storage of the analyzed data. Users then will be able to manage how and when the data is shared. A voice assistant that assures higher levels of privacy will make people more comfortable with monitoring for heart and respiratory health, sleep states, and overall wellness.

The advances in AI that we see today are driven by deep learning research and new types of hardware used to build specialized deep learning systems. Infineon’s partner, Syntiant, a pioneer in this area, is building a new class of chips that bring deep learning to edge devices. Within just a few years, human-machine interaction aided by voice assistant technology will be an everyday occurrence for billions of people. And the technology developed for smarter voice assistants will have power use characteristics that allow for small, battery-powered intelligent audio recognition for many other applications. To forecast where else the technology has value, consider how the sounds you hear affect the way you interact with the world. Outside of the view of everyday users, voice assistant technology will become a part of the sensor suite in smart machines operating in the Internet of things (IoT) and as part of Industry 4.0.

Autonomous vehicles will also use audio input in combination with other sensors to detect and respond to the surrounding environment. Sounds such as bicycles, trains, other traffic, and shouting children are all inputs to the AI network that will enable cars to “see” objects around corners. In a factory, the sounds of operating machines can be used in smart control networks that diagnose potential problems before they happen. Smart city systems will “hear” unusual events such as glass breaking or a vehicle accident and alert proper authorities. And future generations of robots will employ audio systems as part of the sensor network supporting intelligent operation and interaction. Indeed, the list of potential applications is endless. 

MEMS microphones surpass human hearing

Human hearing and cognitive processing are part of an incredibly rich sensory system. Yet, we are heading to a day when AI-based voice assistants exceed human capabilities in some respect. In today’s voice assistants, arrays of tiny microphones work with smart chips to accurately detect and interpret incoming sounds. A key part of this is far field recognition, where sensitive MEMS microphones and voice processer chips use advanced audio processing algorithms to hear even whispered speech across a room. Other algorithms help the microphone array distinguish the specific voice issuing commands from a room with multiple sound sources, including other people, television and radio. At Infineon, we’ve developed a demonstration system that combines microphones and a voice processor with tiny radar chips to further improve presence detection and focus (Photo 1).

Acting AG Whitaker

Photo 1. The sensor fusion of Infineon’s radar and MEMS microphone with audio processors from XMOS provides a new building block for voice assistant platforms. (Image: Infineon Technologies)

Emile Berliner, the inventor of the microphone that made the phone practical, would be amazed at the miniaturization of the technology, but he still would recognize the operating principles of sound capture and playback. The MEMS microphones available today use the same principle as the first practical microphones developed by Berliner; the air pressure of sound waves is sensed by a thin membrane that converts the pressure to electrical signals. MEMS microphones detect audible sound from the 0 db SPL (sound pressure level) of a whisper to live rock music levels of 120 dB SPL. The dB scale is logarithmic, which means that energy levels of 120 dB SPL are twelve orders of magnitude (1 trillion times) larger than a 0 dB SPL sound.

In many applications, the sensitivity of state-of-the art MEMS microphones exceeds what the typical human ear can hear. The latest generation devices offered by Infineon operate with up to 10 dB superior performance in the signal to noise ratio (SNR) when compared to alternative microphones of similar size. (Photo 2). For audio processing in the next generation systems, this translates to improvements in the audio signal that can improve overall sensitivity and reduce error rates. 

The latest generation devices offered by Infineon operate with up to 10 dB superior performance in the signal to noise ratio (SNR) when compared to alternative microphones of similar size. (Photo 2).

Photo 2. Infineon’s dual backplate MEMS technology uses a membrane embedded within two backplates thus generating a truly differential signal. The SNR is improved by 6 dB to 70 dB which is equivalent to doubling the distance from which a user can give a voice command that is captured by the microphone. (Source: Infineon Technologies)

Smart today, smarter tomorrow

Audio processing technologies used today typically apply such concepts as echo cancellation and active filtering to suppress unwanted noise and isolate the target audio signal for speech recognition. In effect, this type of audio recognition treats noise information as background. Next generation neural net AI processors will take a different approach and learn to tell the difference between noise and useful signal. Infineon is working with partners today on the microphone and hardware combinations that will make this possible. These collaborations also are aimed at providing the development tools required to design AI-powered audio sensing and speech recognition systems for industrial, commercial, and consumer products.

In the near future, voice assistant technology will allow us to engage in purposeful conversations between people and the machines we use, even with no connectivity to the cloud. And the same smart audio technology will be integrated into sensor systems that monitor our health and safety. It’s all part of a continuing revolution in voice communications that will allow humans to interact with machines in new ways and for machines that will sense and respond to the environment where they are used.

— Pradyumna Mishra is entrepreneur-in-residence, Infineon Technologies