Rhino Speech-to-Intent Engine FAQ

See also the General FAQ

How many commands (expressions) can Picovoice speech-to-intent software understand?

There is no technical limit on the number of commands (expressions) or slot values Picovoice speech-to-intent software can understand. However, on platforms with limited memory (MCUs or DSPs), the total number of commands and vocabulary will be dictated by the available amount of memory. Roughly speaking, for 100 commands and unique words, you should allocate around 50KB of additional memory.

Which natural languages does Rhino speech-to-intent support?

At the moment, we only support the English language. For significant commercial opportunities, we may be able to prioritize and partially reinvest commercial license fees into supporting new natural languages for customers.

What is Rhino speech-to-intent detection accuracy?

Picovoice has done rigorous performance benchmarking on its Rhino speech-to-intent engine and published the results publicly here. In addition, the audio data and the code used for benchmarking have been made publicly available under the Apache 2.0 license to allow for the results to be reproduced

Rhino speech-to-intent engine can extract intents from spoken utterances with higher than 97% accuracy in clean (no noise) environments, and 95% accuracy in noisy environments with signal to noise ratio of 9dB at microphone level.

Can Rhino understand phone numbers, time of day, dates, alphanumerics, etc?

Yes, Rhino can accurately understand numbers, alphanumerics, and similar challenging parameters. Here is a demo of phone dialing interaction running on ARM Cortex-M4 processor simulating a hearable application.

Which platforms does Rhino speech-to-intent engine support?

Rhino speech-to-intent is supported on Raspberry Pi (all models), BeagleBone, Android, iOS, Linux, macOS, Windows, and modern web browsers (WebAssembly). Additionally we have support for various ARM Cortex-A and ARM Cortex-M (M4/M7) MCUs by NXP and STMicro.

As part of our professional service, we can port our software to other proprietary platforms such as DSP cores or Neural Net accelerators depending on the size of the commercial opportunity. Such engagement typically warrants non-recurring engineering fees in addition to prepaid commercial royalties.

Does Picovoice speech-to-intent software work in my target environment and noise conditions?

The overall performance depends on various factors such as speaker distance, level/type of noise, room acoustics, quality of microphone, and audio frontend algorithms used (if any). It is usually best to try out our technology in your target environment using freely available sample models. Additionally, we have published an open-source benchmark of our speech-to-intent software in a noisy environment here, which can be used as a reference.

Does Picovoice speech-to-intent software work in presence of noise and reverberation?

Yes, Picovoice speech-to-intent engine is resilient to noise, reverberation, and other acoustic artifacts. We have done rigorous performance benchmarking on Rhino speech-to-intent engine and published the results publicly here. In addition, the audio data and the code used for benchmarking have been made publicly available under Apache 2.0 license to reproduce the results. The results show 92% accuracy in a noisy environment with signal to noise ratio of 9dB at microphone level.

Is there a limit on the number of slot values?

There is no technical limit on the number of slot values Picovoice speech-to-intent software can understand. However, on platforms with limited memory (particularly MCUs), the total number will be dictated by the available amount of memory. Roughly speaking, for each 100 unique words/phrases, you should allocate around 50KB of additional memory.

Are there any best practices for designing speech-to-intent context (Interaction model)?

The design process for the Picovoice speech-to-intent interaction model (or context) is similar to designing Alexa skills. In general, you have to make sure your context follows the common patterns for situational design:

  • Adaptability: Let users speak in their own words.
  • Personalization: Individualize your entire interaction.
  • Availability: Collapse your menus; make all options top-level.
  • Relatability: Talk with them, not at them.

I need to use speech-to-intent software in an Interactive Voice Response (IVR) application. Is that possible?

Yes. Picovoice speech-to-intent software is a powerful tool for building IVR applications. However, please note that Picovoice software only works well on 16kHz audio and does not perform optimally in telephony applications that use 8kHz audio.

Does Picovoice speech-to-intent software perform endpointing?

Yes, Picovoice speech-to-intent software performs endpointing automatically.

Does my application need to listen to a wake word before processing the audio with speech-to-intent software?

Speech-to-intent software requires a method of initiation to start listening when the user is about to speak. That could be implemented by either push-to-talk switch or by the Picovoice wake word detection engine, depending on the customer requirement.

How do I develop a speech-to-intent context model (.rhn) file?

Use the Picovoice Console to build a context and then train a model file. You need to capture all anticipated expressions users would say to mean something that your context intends to handle. The expressions are written using the Rhino syntax. The Rhino grammar provides a set of features that will allow you to express many combinations of phrases, such as (optional phrases), and [choice,of,phrase].

For example, in a smart lighting application, the user might say:

  • "[set, change, switch, make, turn] (the) $room:room1 (light) (to) $color:color1"
  • "[set, change, switch, make, turn] (the) color in $room:room1 (to) $color:color1"

See this Picovoice Console tutorial for how to build a context.

What’s the advantage of using Picovoice speech-to-intent software instead of using Speech-to-Text and input the transcribed text into an NLU engine to extract intents?

Using a generic speech-to-text engine with NLU usually results in suboptimal accuracy without any tuning. We have benchmarked the performance of Picovoice Rhino against several alternatives including Google Dialogflow, Amazon Lex, IBM Watson, and Micrsooft LUIS here.