How to Benchmark a Rhino Context
- Basic Python and knowledge of the command line
- No machine learning knowledge needed
- A modern web browser
- A working microphone
- Ubuntu on x86_64 (Tested on v18.04)
- Python 3 and pip installed locally
In Voice User Interfaces (VUI), accuracy is essential. A rigorous and reproducible way of testing accuracy can be achieved by recording utterances from a variety of speakers with diverse voices and replaying them together with environment noise in a controlled fashion. This way of benchmarking provides objectivity and confidence in the accuracy of a VUI.
In this tutorial, we will build a VUI for a voice-enabled coffee maker and benchmark the underlying technology: the Picovoice Rhino Speech-to-Intent Engine. Rhino extracts intent directly from spoken utterances—without the error-prone and expensive Speech-to-Text intermediate step typically paired with natural language understanding (NLU). This direct approach allows for dramatically better accuracy and run-time efficiency.
Rhino requires you to express a targeted domain of interest, referred to as a "context." The using the Picovoice Console allows you to design the context; you can also test it in-browser as you iterate. Once you have finalized your design, you can train and export your context to run it offline on the targeted platform.
Test Speech Dataset
In our benchmark, we recorded sample audio commands from 50 unique speakers. Each speaker contributed about 10-15 different commands. Collectively, 619 audio recordings are used in this benchmark. We sourced these audio commands by hiring voice actors. Alternatively, you can ask your colleagues or friends to each contribute a set of sample utterances, or crowd-source utterances through services like Amazon Mechanical Turk. For benchmarking, you should use audio files recorded from diverse speakers with different accents that best represent the population of the target end users.
Here is a sample of one of our clean speech commands:
Give me an eight ounce triple shot espresso with a little bit of sweetener
Creating a Context
For a more in-depth tutorial on how to use Picovoice Console to create a Speech-to-Intent context from scratch, see the Rhino quick start.
Picovoice Console Account
Before we begin, you will need a Picovoice Console account.
After you create an account, login to the Picovoice Console and go to the Rhino section.
Select the "Coffee Maker" template, set a name; for example: “CoffeeMaker”, and click "Create context". Then, click on the created context in the table below to go to the context editor.
You can see the complete coffee maker context we have designed. It includes “orderDrink” intent and slots including “coffeeDrink”, “numberOfShots”, “roast”, “size”, “milkAmount”, and “sugarAmount”. The intent has many expressions that include some of these slots. If any of these expressions occur in an speech utterance, then “orderDrink” intent will be returned, along with the specific slot values. We are using 48 expressions that capture many different ways of interacting with the coffee maker in this context.
Test and Submit for Training
Save your context, and try it out in the browser using the microphone button on the right column.
Once it's completed, submit your context to be trained and choose “Linux (x86_64)” as your Model Platform. When it is finished training, we can use the resulting context model file with Rhino.
Getting Started with Rhino
We highly recommend completing this tutorial in a virtual environment. The code and the audio files used in this tutorial are made available here on GitHub. Let's first clone the speech to intent benchmark repository and its submodules:
git clone --recurse-submodules https://github.com/Picovoice/speech-to-intent-benchmark.git
Install dependencies including numpy, soundfile, and matplotlib using pip:
pip install -r requirements.txt
Next, install libsndfile using the package manager:
sudo apt-get install libsndfile1
Mix Clean Speech Data with Noise
Real-world scenarios usually have ambient noise. For example, in a cafe environment, there are noises from conversations, machines, dishes, etc. To simulate such a noisy environment, we’re going to add two background noise audio samples representing cafe and kitchen environments. These noise samples were downloaded from Freesound. You can record or source other types of background noises applicable to your end application environment:
We are going to mix clean speech audio files with these two noises. To do that, first make sure you are in the root directory. Then, run the following commands:
python benchmark/mixer.py cafepython benchmark/mixer.py kitchen
This creates noisy speech commands by mixing background noise with clean audio files
(located here) at
sound-to-noise (SNR) ratios of 24, 21, 18, 15, 12, 9, and 6 decibels (dB). The lower the SNR, the noisier the audio data
will be. These noisy audio files appear in new folders under
The lower the SNR, the noisier audio data will be. For example, SNR value of 0 dB means that the signal level and noise level are equal. It’s very difficult and almost impossible for humans to understand speech in such an environment. For example for home assistants, standard stress testing for voice assistants is performed inside sound chambers in 5~6 dB SNR scenarios, which emulate a typical noisy household environment with TV on and some kitchen appliances running at a moderate distance.
To get a feel of how these different noise levels sounds like, let's listen to some samples.
These next two audio files are the same as clean speech commands from Test Speech Dataset after mixing with cafe noise and kitchen noise at a SNR of 24 dB. Although these files are almost as clear as the clean audio, the noise is still noticeable:
The SNR for these next two audio files are at 6 dB. Notice how noisier these speech commands are:
Process Files with Picovoice Rhino
Let’s use Rhino to process the noisy speech commands we just generated.
If you created your own custom context, don’t forget to change the
_context_path attribute accordingly in
Process the audio files mixed with cafe noise and kitchen noise.
python benchmark/benchmark.py --engine_type PICOVOICE_RHINO --noise cafepython benchmark/benchmark.py --engine_type PICOVOICE_RHINO --noise kitchen
The python script creates a Rhino instance, processes each speech command, compares the detected intent and slot values with the provided label values, and outputs Rhino’s accuracy (measured by command acceptance rate) at each SNR. Processing all files (620x7=4340) took approximately 10 minutes to finish (almost 100 milliseconds per file) on a mid-range laptop.
Let’s evaluate how well Rhino performed. We’re going to measure accuracy in terms of Command Acceptance Rate (CAR). A command is accepted if the intent and all of the slots are understood and detected correctly; if any error occurs in understanding the expression, intent, or slot values, the command is rejected. CAR is the ratio of accepted command to the total number of commands. Here's an example of a command that was rejected:
Can I have a twenty ounce double shot cappuccino with a bit of brown sugar?
Rhino incorrectly understood the slot value for
triple shot instead of
double shot. Thus, this command was rejected.
For the full results, below is a graph showing Rhino's detection accuracy at different SNRs in cafe and kitchen environments:
The y-axis indicates accuracy measured in terms of Command Acceptance Rate (CAR). The red and green lines show CAR at each background noise intensity level (SNR) for kitchen and cafe noise environments respectively, and the dotted blue shows the average between theses two.
Rhino achieves %98.5 and 93% accuracy at 24dB and 6dB noisy environment respectively. It does exceptionally well, even in noisy environments.
You may wonder how Rhino accuracy and performance compares to other natural language understanding engines. In subsequent articles, we will compare benchmark results against other on-device and cloud-based NLU implementations. In this article, we benchmark and compare performance results against the Google Dialogflow cloud-based NLU solution.
To learn more about the Rhino Speech-to-Intent engine, see the Rhino documentation.