Tutorial: Benchmarking Rhino Speech-to-Intent against Amazon Lex

  • Rhino
  • Speech-to-Intent
  • Amazon Lex
  • AWS
  • Voice commands
  • On-device voice recognition

Introduction

Skill Level

  • Basic knowledge of Python and the command line
  • No machine learning knowledge needed

Prerequisites

  • Ubuntu x86 (Tested on version 18.04)
  • Python 3 and pip installed locally
  • An AWS account or IAM User with AWS Management console permissions

In an earlier tutorial, we demonstrated how to benchmark and objectively measure the accuracy of an example Rhino™ speech-to-intent context designed for a voice-enabled coffee maker.

We’re going to continue our analysis in this tutorial and compare the performance of Picovoice’s domain-specific natural language understanding (NLU) engine, Rhino™ Speech-to-Intent, with the Amazon Lex cloud-based NLU solution.

Amazon Lex is an AWS service that enables developers to build natural language chatbots. For speech inputs, Amazon Lex uses Amazon Transcribe behind the scenes to convert speech to text, and then processes the text to understand the user's intent. Lex uses the knowledge learned from sample utterances provided during the training phase to detect the user intent and generate a response.

For comparison, we will use the same example context that models a user's interaction with a voice-enabled coffee maker. For the sake of brevity, we are not repeating the steps to design and train the context model file for Rhino using the Picovoice Console. Instead, we will use the results: the trained model file and exported config/setup files.

Picovoice Console

Test speech dataset

Our speech dataset consists of a total of 619 sample audio commands from 50 unique speakers who each contributed about 10-15 different commands. We hired freelance voice-over professionals with different accents online for the data collection task. For objective results, it’s important to gather audio from a variety of voices that represent a sample of user population. Here is an example of a clean speech command:

To better simulate real world settings, we will create noisy voice samples by mixing our clean speech audio with background noise. Since a coffee maker is typically used in a kitchen or coffee shop, we chose background noises representing cafe and kitchen environments (obtained from Freesound).

We are then going to detect the “orderDrink” intent and slots including “coffeeDrink”, “numberOfShots”, “roast”, “size”, “milkAmount”, and “sugarAmount” from these noisy speech commands. Here’s a example of intent and slots detected in an utterance:

"I’d like a dark roast medium coffee with a little bit of brown sugar and almond milk."

{
intent: "orderDrink",
slots:{
milkAmount: "almond milk",
sugarAmount: “a little bit of brown sugar”,
coffeeDrink: "coffee",
roast: "dark roast",
size: "medium",
}
}

Getting Started

We recommend following this tutorial in a Python virtual environment. The code and the audio files used in this tutorial are available on GitHub. To begin, we will first need to clone the speech to intent benchmark repository and its submodules from GitHub:

git clone --recurse-submodules https://github.com/Picovoice/speech-to-intent-benchmark.git

Make sure you have SSH keys setup with your GitHub account; otherwise, you’d need to switch to HTTPS password-based authentication. Install dependencies (including numpy, soundfile, boto3, and matplotlib) using pip:

pip install -r requirements.txt

Then, install libsndfile using apt:

sudo apt-get install libsndfile1

Mix Clean Speech Data with Noise

To more accurately represent real world situations, we want to evaluate the detection accuracy in the presence of ambient noise. To do so, we’re going to mix the clean speech audio files with our cafe and kitchen noise samples at different intensity levels. First, make sure you are in the root directory. Then, mix clean speech data files under /speech/clean with noise from two different noise environments, cafe and kitchen.

python benchmark/mixer.py cafe
python benchmark/mixer.py kitchen

These commands mix background noise with the clean audio files at signal-to-noise (SNR) ratios of 24, 21, 18, 15, 12, 9, and 6 decibels (dB) and save the generated audio under /data/speech/.

The lower the SNR, the noisier the audio data. For example, an SNR value of 0 dB means that the signal level and noise level are equal; it's difficult for algorithms and almost impossible for humans to understand speech in such an environment. At 3 dB, the signal strength is double the noise level. Industry standard stress testing for home assistants is performed inside sound chambers at 5~6 dB SNR scenarios, which emulate a typical noisy household environment with TV on and some kitchen appliances running at a moderate distance. Listen to these sample noisy speech commands to feel what 6dB SNR sounds like:

The following sample noisy speech commands are at 24dB SNR. The noise is noticeable, but significantly less than the 6 dB SNR samples.

Process utterances with Picovoice Rhino

In a previous article, we processed the audio files of sample utterances with Rhino and analyzed the results. For brevity, we include only the end result below.

Detection accuracy in presence of background noise with varying intensity levels
Detection accuracy in presence of background noise with varying intensity levels

The blue line shows the accuracy measured in terms of Command Acceptance Rate (CAR) at each background noise intensity level (SNR) averaged over cafe and kitchen noise environments. A command is accepted if the intent and all of the slots are understood and detected correctly. The command is rejected if any error occurs in understanding the expression, intent, or slot values.

Later, we will process the same commands with Amazon Lex and compare our results.

Process Files with Amazon Lex

Unlike Rhino–which has a syntactic approach to formulate a domain-specific NLU–Amazon Lex is a statistical engine. To build a context with Rhino, you have to provide expressions and use syntactic rules to represent variations of those expressions. However, you cannot provide compact rules to Amazon Lex. Instead you need to provide a set of sample utterances for training an agent. We’re going to build two bots (or agents), train them with a different number of sample utterances, and investigate how the size of the training set affects the detection accuracy of the bot. To begin, you need to an AWS account .

Setup Amazon Lex Bots

We are going to use previously exported zip files barista_50.zip and barista_432.zip to create and train each bot.

Each barista zip file contains a JSON file that defines our “barista” bot. It contains our “orderDrink” intent along with the sample utterances used to train the bot. It also lists each slot type and its possible slot values. barista_50.zip trains the bot on a random subset of 50 unique sample utterances, whereas barista_432.zip contains 432 unique training sample utterances that trigger the “orderDrink” intent.

In the Amazon Lex console, import barista_50.zip from /data/amazonlex.

Import Amazon Lex bot

Then, build and publish your bot with an alias name of your choosing by following this guide.

AWS Authentication setup

Follow these steps to create access keys for your IAM user.

Then, follow this guide to set up authentication credentials either by using AWS CLI or by creating the credential file.

Process files

Run the noisy speech commands through Amazon Lex:

python benchmark/benchmark.py --engine_type AMAZON_LEX --noise cafe
python benchmark/benchmark.py --engine_type AMAZON_LEX --noise kitchen

Repeat the last 3 steps, but this time import barista_432.zip instead of barista_50.zip. Proceed with this step only when you are completely finished with the first bot because importing barista_432.zip will overwrite any existing intents and slot types.

Results

In this section, we compare the results gathered from running both engines and look at accuracy, latency, and processing fees. Let’s look at an example command:

“brew a light roast double shot cappuccino with brown sugar”

When provided this command in an utterance with 6dB SNR, Amazon Lex only detects the “sugarAmount” and “roast” slot values:

{
"intent": "orderDrink",
"slots": {
"sugarAmount": "brown sugar",
"roast": "light roast"
}
}

A deeper look into the speech-to-text output before NLU operation by Lex reveals that the several transcript errors:

"light roast duck with brown sugar"

These transcription errors result in not capturing slot values for NumberofShots and CoffeeDrink.

Accuracy

Below is a graph comparing the performance of Rhino and Amazon Lex where we’ve averaged their respective command acceptance rates (CAR) on noisy commands mixed with cafe and kitchen noise.

Comparing Picovoice Rhino and Amazon Lex detection accuracy in presence of background noise with varying intensity levels
Comparing Picovoice Rhino and Amazon Lex detection accuracy in presence of background noise with varying intensity levels

These results also show that Rhino can be significantly more accurate than Amazon Lex. Rhino can consistently achieve up to 98% accuracy, whereas Amazon Lex can reach up to 87% accuracy.

We also see that the number of sample utterances included in the model makes a significant difference with Amazon Lex, as opposed to what we saw with Google Dialogflow in a previous tutorial. With only 50 sample utterances, Amazon Lex approaches a threshold of 55% accuracy. In contrast, the accuracy increases up to 87% when we include all 432 sample utterances.

Latency

It takes approximately 45 minutes for Amazon Lex to process 619 requests for each noise environment which average to a little over 4 seconds per API call. The total processing time for all 7 noise environments was 315 minutes. In contrast, it took approximately 10 minutes for Rhino to finish processing all commands locally on a mid-range consumer laptop.

Cost

There are quota limits, fees, and restrictions associated with using Amazon Lex. From the date you get started with Amazon Lex, you can process up to 5,000 speech requests per month for free during the first year. At the time of this article, every voice API call costs $0.00475 per request after you pass this threshold. Thus, it costs about $15 to process all 619 speech requests at 7 different signal-to-noise ratios in 2 different noise environments. In contrast, Rhino Speech-to-Intent does not incur a fee per user interaction, and instead the cost is bounded per device install.

Other considerations

Amazon Lex has two options for Slot Resolution: “Restrict to Slot Values and Synonyms” and “Expand Values.” We achieved the previous results with Slot Resolution set to the first option. The first option only fills a slot when the user input is exactly the same as one of the provided slot values. The second option fills a slot when the slot in the input utterance is similar to the provided values. We did not achieve a significant increase in accuracy with “Expand Values” enabled, but this option may be useful depending on your use case.

Comparison with other solutions

In a previous tutorial, we benchmarked the performance of Google Dialogflow NLU and compared it against Rhino. To put things in perspective, we made a chart comparing the command acceptance rate of Picovoice Rhino, Google Dialogflow, and Amazon Lex below.

Comparing Picovoice Rhino, Google Dialogflow, and Amazon Lex detection accuracy in presence of background noise with varying intensity levels
Comparing Picovoice Rhino, Google Dialogflow, and Amazon Lex detection accuracy in presence of background noise with varying intensity levels

Both Amazon Lex and Google Dialogflow agents are trained using the same set of 432 sample utterances. Amazon Lex delivers 87% accuracy in a clean speech environment (24dB) while Google Dialogflow’s CAR is at 82%. Both engines' accuracy deteriorate at almost the same rate as noise level increases.