Tutorial: How to Benchmark a Rhino Context

  • VUI
  • NLU
  • Speech to Intent
  • Picovoice Rhino

Introduction

Skill Level

  • Basic Python and knowledge of the command line
  • No machine learning knowledge needed

Prerequisites

  • A modern web browser
  • A working microphone
  • Ubuntu x86 (Tested on v18.04)
  • Python 3 and pip installed locally

In Voice User Interfaces (VUI), accuracy is essential. A rigorous and reproducible way of testing accuracy can be achieved by recording utterances from a variety of speakers with diverse voices and replaying them together with environment noise in a controlled fashion. This way of benchmarking provides objectivity and confidence in the accuracy of a VUI.

In this tutorial, we will build a VUI for a voice-enabled coffee maker and benchmark the underlying technology: the Picovoice Rhino™ Speech-to-Intent engine. Rhino extracts intent directly from spoken utterances—without the error-prone and expensive Speech-to-Text intermediate step typically paired with natural language understanding (NLU). This direct approach allows for dramatically better accuracy and run-time efficiency.

Rhino requires you to express a targeted domain of interest, referred to as a "context." The using the Picovoice Console allows you to design the context; you can also test it in-browser as you iterate. Once you have finalized your design, you can train and export your context to run it offline on the targeted platform.

Test Speech Dataset

In our benchmark, we recorded sample audio commands from 50 unique speakers. Each speaker contributed about 10-15 different commands. Collectively, 619 audio recordings are used in this benchmark. We sourced these audio commands by hiring voice actors. Alternatively, you can ask your colleagues or friends to each contribute a set of sample utterances, or crowd-source utterances through services like Amazon Mechanical Turk. For benchmarking, you should use audio files recorded from diverse speakers with different accents that best represent the population of the target end users.

Here is a sample of one of our clean speech commands:

Give me an eight ounce triple shot espresso with a little bit of sweetener

Creating a Context

If you’d like to skip this section, you can use the pre-trained Coffee Maker context model located here), and read on to Getting Started with Rhino.

For a more in-depth tutorial on how to use Picovoice Console to create a Speech-to-Intent context from scratch, check out this tutorial.

Picovoice Console Account

Before we begin, you will need a Picovoice Console account. Personal accounts are available for free, and Enterprise accounts are available with a company email address under a 30-day free trial. You need an Enterprise Account to access advance Speech-to-Intent context templates, which we use to train our coffee maker model here.

After you create an Enterprise account, login to the Picovoice Console and go to the Speech-To-Intent page.

Select Coffee Maker (Advanced) template from the drop down menu under Template, set a name; for example: “CoffeeMakerTemplate”, and create context. Then, click on the created context to go to the context editor. You can see the complete coffee maker context we have designed.

It includes “orderDrink” intent and slots including “coffeeDrink”, “numberOfShots”, “roast”, “size”, “milkAmount”, and “sugarAmount”. The intent has many expressions that include some of these slots. If any of these expression occur in an speech utterance, then “orderDrink” intent will be returned along with the detected slot values. We are using 48 expressions that capture many different ways of interacting with the coffee maker in this context.

Loading The Coffee Maker Template

Test and Submit for Training

Save your context, and try it out in the browser using the microphone button on the right column.

Once it's completed, submit your context to be trained and choose “Linux (x86_64)” as your Model Platform. When it is finished training, we can use the resulting context model file with Rhino.

Getting Started with Rhino

We highly recommend completing this tutorial in a virtual environment. The code and the audio files used in this tutorial are made available here on GitHub. Let's first clone the speech to intent benchmark repository and its submodules:

git clone --recurse-submodules https://github.com/Picovoice/speech-to-intent-benchmark.git

Install dependencies including numpy, soundfile, and matplotlib using pip:

pip install -r requirements.txt

Next, install libsndfile using the package manager:

sudo apt-get install libsndfile1

Mix Clean Speech Data with Noise

Real-world scenarios usually have ambient noise. For example, in a cafe environment, there are noises from conversations, machines, dishes, etc. To simulate such a noisy environment, we’re going to add two background noise audio samples representing cafe and kitchen environments. These noise samples were downloaded from Freesound. You can record or source other types of background noises applicable to your end application environment:

We are going to mix clean speech audio files with these two noises. To do that, first make sure you are in the root directory. Then, run the following commands:

python benchmark/mixer.py cafe
python benchmark/mixer.py kitchen

This creates noisy speech commands by mixing background noise with clean audio files (located here) at sound-to-noise (SNR) ratios of 24, 21, 18, 15, 12, 9, and 6 decibels (dB). The lower the SNR, the noisier the audio data will be. These noisy audio files appear in new folders under /data/speech.

The lower the SNR, the noisier audio data will be. For example, SNR value of 0 dB means that the signal level and noise level are equal. It’s very difficult and almost impossible for humans to understand speech in such an environment. For example for home assistants, standard stress testing for voice assistants is performed inside sound chambers in 5~6 dB SNR scenarios, which emulate a typical noisy household environment with TV on and some kitchen appliances running at a moderate distance.

To get a feel of how these different noise levels sounds like, let's listen to some samples.

These next two audio files are the same as clean speech commands from Test Speech Dataset after mixing with cafe noise and kitchen noise at a SNR of 24 dB. Although these files are almost as clear as the clean audio, the noise is still noticable:

The SNR for these next two audio files are at 6 dB. Notice how noisier these speech commands are:

Process Files with Picovoice Rhino

Let’s use Rhino to process the noisy speech commands we just generated.

If you created your own custom context, don’t forget to change the _context_path attribute accordingly in engine.py, located under /benchmark/engine.py.

Process the audio files mixed with cafe noise and kitchen noise.

python benchmark/benchmark.py --engine_type PICOVOICE_RHINO --noise cafe
python benchmark/benchmark.py --engine_type PICOVOICE_RHINO --noise kitchen

The python script creates a Rhino instance, processes each speech command, compares the detected intent and slot values with the provided label values, and outputs Rhino’s accuracy (measured by command acceptance rate) at each SNR. Processing all files (620x7=4340) took approximately 10 minutes to finish (almost 100 milliseconds per file) on a mid-range laptop.

Results

Let’s evaluate how well Rhino performed. We’re going to measure accuracy in terms of Command Acceptance Rate (CAR). A command is accepted if the intent and all of the slots are understood and detected correctly; if any error occurs in understanding the expression, intent, or slot values, the command is rejected. CAR is the ratio of accepted command to the total number of commands. Here's an example of a command that was rejected:

Can I have a twenty ounce double shot cappuccino with a bit of brown sugar?

Rhino incorrectly understood the slot value for numberOfShots as triple shot instead of double shot. Thus, this command was rejected.

Accuracy

For the full results, below is a graph showing Rhino's detection accuracy at different SNRs in cafe and kitchen environments:

Picovoice Rhino V1.4
Picovoice Rhino V1.4

The y-axis indicates accuracy measured in terms of Command Acceptance Rate (CAR). The red and green lines show CAR at each background noise intensity level (SNR) for kitchen and cafe noise environments respectively, and the dotted blue shows the average between theses two.

Rhino achieves %98.5 and 93% accuracy at 24dB and 6dB noisy environment respectively. It does exceptionally well, even in noisy environments.

What's next?

You may wonder how Rhino accuracy and performance compares to other natural language understanding engines. In subsequent articles, we will compare benchmark results against other on-device and cloud-based NLU implementations. In this article, we benchmark and compare performance results against the Google Dialogflow cloud-based NLU solution.

Additional Resources

To learn more about the Rhino Speech-to-Intent engine, check out the Rhino SDK on GitHub.

You can create your own always-listening smart coffee maker (see our Barista demo) by combining Rhino with Porcupine, Picovoice’s wake word engine.

To see Rhino handle different contexts, try our additional demos such as Elevator and Conference Telephone.