Using Speech to Text in voice assistants is the common approach. The Conventional Spoken Language Understanding method transcribes speech data with Speech to Text and extracts meaning, i.e., intent, by processing the transcribed text with Natural Language Understanding. The accuracy of these voice assistants relies on the performances of independently trained Speech to Text and Natural Language Understanding modules. Erroneous Speech to Text outputs lead to incorrect Natural Language Understanding predictions, and we know generic Speech to Text models have limitations. Thus, finding the best Speech to Text for voice assistants is challenging.

Open-domain voice assistants, such as Alexa, Siri, and Google Assistant, use the Conventional Spoken Language Understanding approach. One of the reasons is the dataset availability. Open-domain voice assistants multitask, from retrieving information about weather, nutrition, or history to taking action to set a timer and play fun sounds and songs. Text-based Natural Language Understanding has been around longer than speech-based Natural Language Understanding, hence has richer datasets, making it a more suitable solution. However, every voice assistant doesn’t operate in the open domain. Technicians at an auto shop don’t need to interact with the voice assistant to fix an airplane or ship, let alone get their favorite song played.

We expect technicians to fix a “carburetor” and chefs to fix “carbonara.” So they get trained accordingly. Voice assistants should also be trained to improve productivity and minimize errors. Otherwise, their value-add, hence adoption, would be limited. Yet, the go-to speech technology for many developers, generic Speech to Text, does not have this specialization, i.e., context awareness. Alexa can explain how to fix both “carburetor” and “carbonara.” but sometimes can mix the terms as they may sound similar and bring information from unreliable resources. It might not be an issue for someone asking questions for fun at home. However, time, accuracy, and precision are valuable in auto shops or commercial kitchens.

After listening to the challenges in the market to find the best Speech to Text for voice assistants, we decided to take a different path and build a solution that addresses the need more directly: Speech-to-Intent. Speech-to-Intent is a context-aware alternative to Speech to Text. It combines Speech to Text and Natural Language Understanding, resulting in more accurate and faster voice assistants in the domain. The downside? It’s not fit for open-domain voice assistants. A voice assistant specializing in auto repairs is a “professional” technician helper only. It’s not a technician, singer, nutritionist, timer, or door opener at once.

If you need a more accurate and responsive alternative to Speech to Text for your domain-specific voice assistant, try Rhino Speech-to-Intent. Rhino Speech-to-Intent is six times more accurate than Big Tech alternatives - Google Dialogflow, Amazon Lex, Microsoft LUIS, and IBM Watson - proven by an open-source benchmark. If you need to discuss your specific use case with an expert, leverage Picovoice’s Consulting Services.

Start Free