Training your own voice assistant

Sanyam Bhutani
Sanyam Bhutani

How do voice controlled devices work?

Voice recognition is a computer software program or the hardware device with the ability to decode the human voice. Voice recognition is commonly used to operate a device, perform commands, or write without having to use a keyboard, mouse, or press any buttons. Today, this is done on a computer with automatic speech recognition (ASR) software programs. Many ASR programs require the user to "train" the ASR program to recognize their voice so that it can more accurately convert the speech to text. A voice command device (VCD) is a device controlled by means of the human voice. By removing the need to use buttons, dials, and switches, consumers can easily operate appliances with their hands full or while doing other tasks.

Let us consider 3 such VCDs:

Apple’s SIRI

Siri can be set up to only turn on when the user pushes a button, or to also respond to the phrase “Hey, Siri.” The program taps information from the user’s device, including the user's name, location, the contents of the music library, and names and relationships in Contacts to facilitate answering questions. Some simple commands, such as “send a text,” are handled on the device itself and not sent up to the cloud to be acted upon. More complex actions are recorded, sent to the cloud, translated into words, and then acted upon. A user’s voice recordings are saved for six months so that the recognition system can utilize them to better understand the user’s voice. After six months, another copy is saved, without its identifier, for use by Apple in improving and developing Siri for up to two years.

Google’s Voice Services

In only devices that have Voice and Audio Activity turned on, are listening to the sound data which is flowing through a “hot word detection buffer” on the device. The buffer can only hold a few seconds of data. When it hears “OK, Google,” or the user touches a microphone icon, the app records the speech and audio, plus a few seconds before, and sends it into the cloud to Google. The recordings are only saved if the user has signed on to their Google account and has Voice and Audio Activity on. While on Google’s servers they are encrypted. Users can choose to delete their audio records. These continued to be stored by the company but are no longer linked to the user’s name, but instead with an anonymous identifier.

Amazon’s ECHO

When the device hears its wake-up word, it begins sending a stream of audio to the cloud to be converted into text that the program can understand and act upon. Say “Alexa, what’s the weather today?” and it will give you a weather report for your location. Because only the words after the wake-up word are recorded and sent to the cloud, Amazon only has access to those, which it keeps increasing the accuracy of the system. Audio is always being fed through the Echo's buffer, which is constantly listening for the wake-up word. It holds a small amount of that audio, disposing of it when new sounds come in. It only sends commands and questions to the cloud spoken after the wake-up word. Amazon allows users to go in and erase their voice recordings. It’s also possible to turn Echo’s microphones off so it’s not listening.

As observed, the idea of activating is somewhat similar in the 3 cases and even memory storage is a prime though. To better the AI VCD’s, it is crucial to better the latter.

Different level of sophistication in voice controlled device?

Some of the first examples of VCDs can be found in home appliances with washing machines that allow consumers to operate washing controls through vocal commands and mobile phones with voice-activated dialing. Newer VCDs are speaker-independent, so they can respond to multiple voices, regardless of accent or dialectal influences. They are also capable of responding to several commands at once, separating vocal messages, and providing appropriate feedback, accurately imitating a natural conversation. They can understand around 50 different commands and retain up to 2 minutes of vocal messages.

ASR is just one example of voice recognition, below are other examples of voice recognition systems:- Speaker dependent system - The voice recognition requires training before it can be used, which requires you to read a series of words and phrases. Speaker independent system - The voice recognition software recognizes most users’ voices with no training. Discrete speech recognition - The user must pause between each word so that the speech recognition can identify each separate word. Continuous speech recognition - The voice recognition can understand a normal rate of speaking. Natural language - The speech recognition not only can understand the voice but also return answers to questions or other queries that are being asked.

With the bloom of AI, the tech world is looking forward to the Natural Language Voice Recognition. Start learning the basics today with Natural Language Processing

About the author
Sanyam Bhutani
Sanyam Bhutani

Sanyam Bhutani is a Deep Learning, Computer Vision Practitioner. He has worked on End to End AI based Industrial and Research Projects at Tech Mahindra, ONGC, IIT-Madras, IIT-Roorkee.