Rhino Speech-to-Intent Engine FAQ

What's intent recognition?

Intent recognition, also known as intent classification or intent detection, is the process of analyzing a user's written or spoken input to determine their underlying goal or purpose. By identifying the intent behind utterances, systems like AI agents can respond effectively and interact with humans. Intent recognition is crucial in applications such as customer service, sales automation, menu navigation, or smart devices, enabling streamlined interactions and enhancing user experience.

How can we convert speech into an action using Rhino Speech-to-Intent?

Rhino Speech-to-Intent infers intents and slots from utterances which can be later used to trigger an action.

Below is an example output of Rhino Speech-to-Intent. Developers can use this information to pour “Large skim milk americano with one pump of sugar” if they're building a smart coffee machine or send the order to a barista if they're building a phone ordering software.

{
  intent: "order",
  slots: {
      drink: "americano",
      size: "large", 
      milk: "skim milk",
      sugar: "one",
   }
}

You can learn more about Picovoice's approach to End-to-End Intent Inference from Speech.

How many commands (expressions) can the Rhino Speech-to-Intent engine understand?

There is no technical limit on the number of commands (expressions) or slot values Rhino can understand. However, on platforms with limited memory (MCUs or DSPs), the total number of commands and vocabulary will be dictated by the available amount of memory (FLASH). You can talk to Picovoice Engineering to discuss your use case requirements and hardware limitations.

What is the intent detection accuracy?

Picovoice has done rigorous performance benchmarking on its Rhino Speech-to-Intent engine and open-sourced and published the results of Natural Language Understanding Benchmark results comparing Amazon Lex, Google DialogFlow, Microsoft Luis, IBM Watson, and Picovoice Rhino to help enterprises choose the best natural language understanding engine. The audio data, code, and models used for benchmarking have been made publicly available under Apache 2.0 to facilitate reproducibility.

Rhino Speech-to-Intent engine can extract intents from spoken utterances with higher than 99% accuracy in clean (no noise) environments, and 97% accuracy in noisy environments with the signal-to-noise ratio of 9dB at microphone level.

Can Rhino Speech-to-Intent infer phone numbers, time of day, dates, alphanumerics, etc. from speech?

Yes, Rhino can accurately understand numbers, alphanumerics, and similar challenging parameters. Watch this demo of a phone dialing interaction running on an ARM Cortex-M4 microcontroller simulating a hearable application.

Does Picovoice Speech-to-Intent software work in my target environment and noise conditions?

The overall performance depends on various factors such as speaker distance, level/type of noise, room acoustics, quality of the microphone, and audio frontend algorithms used (if any). It is usually best to try out our technology in your target environment using freely available sample models. Additionally, we have published an open-source benchmark of our Speech-to-Intent software in a noisy environment, which can be used as a reference.

Does Picovoice Speech-to-Intent software work in the presence of noise and reverberation?

Yes, Rhino is resilient to noise, reverberation, and other acoustic artifacts. We have done rigorous performance benchmarking on Rhino and published the results publicly. Also, the audio data and the code used for benchmarking have been made publicly available under the Apache 2.0 license to reproduce the results.

I need to use Speech-to-Intent software in an Interactive Voice Response (IVR) application. Is that possible?

Yes, Rhino is a powerful tool for building IVR applications. However, please note that Picovoice software only works well on 16kHz audio and does not perform optimally in telephony applications that use 8kHz audio.

Does Picovoice Speech-to-Intent engine perform end-pointing?

Yes, it performs end-pointing automatically, also you can set endpoint duration manually. Check out the API of your choice to learn how to do it.

What's sensitivity value? What should I set the sensitivity value to?

Sensitivity value shows how well a test can identify true positives. A higher sensitivity value gives a lower miss rate at the expense of a higher false alarm rate. You should pick a sensitivity parameter that suits your application's requirements.

Does my application need to listen to a wake word before processing the audio with Rhino?

Besides using Porcupine Wake Word, you can implement physical or digital buttons, e.g., touch-to-talk, or a push-to-talk switch and use Cobra Voice Activity Detection, depending on your requirements.

What's the advantage of using Picovoice Speech-to-Intent software instead of using speech-to-text and inputting the transcribed text into a natural language understanding (NLU) engine to extract intents?

Using a generic speech-to-text engine with NLU usually results in suboptimal accuracy without any tuning. Introduction to Spoken Language Understanding discusses these two approaches in detail. We have also benchmarked natural language understanding engines to compare the performance of Picovoice Rhino with Google Dialogflow, Amazon Lex, IBM Watson, and Microsoft LUIS.

Was this doc helpful?

Issue with this doc?