End-to-End Intent Inference from Speech

  • Speech-to-Text
  • Speech-to-Intent
August 29, 2019
Blog Thumbnail

Inferring intent from spoken commands is at the core of any modern voice user interface (VUI). Typically the inference is performed within a well-defined context. For example:

  • [Music Streaming] : Play Tom Sawyer album by Rush

  • [Retail] : Search for sandals that are under $40

  • [Conferencing] : Call John Smith on his cell phone

Current solutions work in two steps. First, speech is converted into text by a speech-to-text (STT) engine. Second, the intent is extracted from the transcribed text using a natural language understanding (NLU) engine. NLU implementations can vary from simple regular expressions to complex probabilistic models or a mix thereof. This approach requires significant compute resources, to process an inherently large-vocabulary STT. Moreover, it gives suboptimal performance as errors introduced by STT impairs NLU performance.

Speech to Text to Intent

Picovoice’s Speech-to-Intent engine takes advantage of contextual information and creates a bespoke jointly-optimized STT and NLU engine, for the domain of interest. The result of this optimization: an end-to-end model that outperforms alternatives in accuracy and runtime efficiency. Additionally, it can run on-device and offline, helping privacy, latency, and cost-effectiveness.

Speech to Intent


We have benchmarked the accuracy of Picovoice Speech-to-Intent against Google Dialogflow. Dialogflow is a cloud-based service for building conversational interactions, including VUIs. The test use case is VUI for a coffee maker. The speech data is crowd-sourced and both engines are given the same amount of training examples. In summary, Picovoice achieves 95% accuracy across a variety of acoustic environments vs 75% command acceptance rate for Dialogflow. The data, code, test setup for the benchmark are open-source and available here.

Picovoice Rhino accuracy vs Dialogflow

This significant accuracy improvement stems from the joint optimization performed when training the end-to-end intent inference model. It is worth noting that even Dialogflow performs some weak joint optimization as it uses advanced features of Google STT API to provide it with the custom vocabulary on which it is operating. Using a generic STT engine with no joint optimization (context-awareness) results in even lower detection rates.

Runtime Efficiency

Efficiency is crucial for offline on-device execution. The Picovoice Speech-to-Intent model size is less than 3 MB (for thousands of spoken commands/words) and uses 5% of a single CPU core of Raspberry Pi 3. The closest alternatives require orders of magnitude more CPU and memory resources. For example, see here.


During our tests, Dialogflow’s average API response time was around 2,100ms while Picovoice’s response time is less than 60ms—even on tiny microcontrollers. This is an inherent limitation with cloud-based solutions, as the total delay is unpredictable and depends on the quality of the network connection and server-side load.

Cost Effectiveness

Cloud computing is not an affordable option for voice processing at scale. At the time of writing, Google Dialogflow charges $0.004 per API call, which amounts to $14.60 per annum for a voice-enabled device with 10 voice interactions per day. This unbounded operating cost can be prohibitive for device builders and app developers. By tapping into readily available on-device compute resources, Picovoice can offer a usage-independent plan supporting our licensees’ business growth.