Benchmarking a Speech-to-Intent Context against Google Dialogflow

  • Rhino
  • Speech-to-Intent
  • Google Dialogflow
  • Speech recognition
  • Local commands

Introduction

Skill Level

  • Basic knowledge of Python and the command line
  • No machine learning knowledge needed

Prerequisites

  • Ubuntu x86 (Tested on version 18.04)
  • Python 3 and pip installed locally
  • A Google Account

In a previous benchmark tutorial, we demonstrated how to benchmark the accuracy of an example Rhino™ speech-to-intent context designed to control a coffee maker by voice.

In this tutorial, we will continue our investigation and compare the performance of Picovoice’s domain-specific natural language understanding (NLU) engine, Rhino™ Speech-to-Intent, against Google Dialogflow.

Google Dialogflow is a cloud-based natural language understanding engine for building chatbots. You can train an agent by providing examples of what a user might say when interacting with it. It then analyzes and understands the user's intent based on the priorly provided knowledge. Dialogflow uses Google speech to text technology behind the scenes to first transcribe speech and then analyses the extracted text to derive the user’s intent.

For comparison, we will use the same example context that models a user's interaction with a voice-enabled coffee maker. For the sake of brevity, we are not repeating the steps to design and train the context model file for Rhino using the Picovoice Console. Instead, we will use the results: the trained model file and exported config/setup files.

Picovoice Console

Test speech dataset

For benchmarking, we recorded sample audio commands from 50 unique speakers. It is important to use speech from a variety of speakers for an unbiased result. Each speaker contributed about 10-15 different commands. Collectively 619 audio recordings are used in this benchmark. Here is a sample clean speech command:

We are going to mix our clean audio with background noise samples representing cafe and kitchen environments to create noisy speech commands. These samples were obtained from Freesound.

We are then going to detect the intent and slots from these noisy speech commands. The only intent here is “orderDrink”. The slots include “coffeeDrink”, “numberOfShots”, “roast”, “size”, “milkAmount”, and “sugarAmount”. There are a fixed number of possible slot values for each slot. Here’s a example of intent and slots detected in an utterance:

"Can I have a small light roast house coffee with some cream?"

{
intent: "orderDrink",
slots:{
milkAmount: "cream",
coffeeDrink: "house coffee",
roast: "light roast"
size: "small",
}
}

Getting Started

We highly recommend completing this tutorial in a virtual environment. The code and the audio files used in this tutorial are available on GitHub.

To begin, we will first need to clone the speech to intent benchmark repository and its submodules from GitHub:

git clone --recurse-submodules https://github.com/Picovoice/speech-to-intent-benchmark.git

Install dependencies including numpy, soundfile, dialogflow, and matplotlib using pip:

pip install -r requirements.txt

Next, install libsndfile using the package manager:

sudo apt-get install libsndfile1

Mix Clean Speech Data with Noise

It is important to evaluate the detection accuracy in the presence of ambient noise, as more accurately represents real world scenarios. We selected two background noise audio samples representing cafe and kitchen environments. We will mix the clean speech audio files with these at different intensity levels.

First, make sure you are in the root directory. Then, mix clean speech data files under speech/clean with noise from two different noise environments, cafe and kitchen.

python benchmark/mixer.py cafe
python benchmark/mixer.py kitchen

This will create noisy speech commands by mixing background noise to the clean audio files at signal-to-noise (SNR) ratios of 24, 21, 18, 15, 12, 9, and 6 decibels (dB). The generated audio files will appear under /data/speech/.

The lower the SNR, the noisier the audio data. For example, a SNR value of 0 dB means that the signal level and noise level are equal; it's difficult for algorithms and almost impossible for humans to understand speech in such an environment. At 3 dB, the signal strength is double the noise level. Industry standard stress testing for voice assistants is performed inside sound chambers at 5~6 dB SNR scenarios, which emulate a typical noisy household environment: the TV is on and some kitchen appliances are running, at a moderate distance.

To get a feel of what 24dB SNR sounds like, listen to these sample noisy speech commands:

These audio samples are almost as clear as the clean speech command from Test Speech Dataset, but the noise is still noticeable.

The SNR for the next two audio files is 6dB. Notice how much noisier these speech commands are:

Process utterances with Picovoice Rhino

We processed the audio files of sample utterances with Rhino and analyzed the end result previously in this article. For brevity, we are only including the end result below.

Detection accuracy in presence of background noise with varying intensity levels
Detection accuracy in presence of background noise with varying intensity levels

The y-axis indicates accuracy measured in terms of Command Acceptance Rate (CAR). A command is accepted if the intent and all of the slots are understood and detected correctly. If any error occurs in understanding the expression, intent, or slot values, the command is rejected. The blue line shows the CAR at each background noise intensity level (SNR) averaged over cafe and kitchen noise environments.

Later we will compare these results with those from Google Dialogflow's attempt to process the same commands from the speech dataset.

Process utterances with Dialogflow

To evaluate Google Dialogflow's ability to process the speech commands we will build two agents and train them with a different number of sample utterances (called “training phrases” in Dialogflow) and investigate how the size of the training set affects the detection accuracy of each agent. To begin, sign up for Dialogflow here using your Google Account, and create a new agent called ‘barista_50’.

Agent setup

We are going to use previously exported zip files that contain the information to create and train each agent. Each zip file contains JSON files that define our agent, an “intents” folder, and an “entities” folder. The “orderDrink” intent and the sample utterances used to train the agent are both found in the “intents” folder. Each slot type and its possible slot values are found in the “entities” folder.

Our files are barista_50.zip and barista_432.zip. The only difference between the two zip files is that barista_50 contains a random subset of 50 unique sample utterances, whereas barista_432 contains all 432 unique sample utterances that we expect to trigger the “orderDrink” intent. The entities are identical between the two zip files.

Now import the barista_50.zip file found under speech-to-intent-benchmark/data/dialogflow.

Import agent file

Authentication setup

Now that we've finished building our agent, we need to set up authentication to be able to call the Dialogflow API. Follow Dialogflow’s “Setting up Authentication” documentation. We're going to need the path to the file that contains your service account key and your project ID to call the API.

Process files

Run the noisy spoken commands through the Dialogflow API:

python benchmark/benchmark.py --engine_type GOOGLE_DIALOGFLOW
--gcp_credential_path ${GOOGLE_CLOUD_PLATFORM_CREDENTIAL_PATH}
--gcp_project_id ${GOOGLE_CLOUD_PLATFORM_PROJECT_ID}
--noise cafe
python benchmark/benchmark.py --engine_type GOOGLE_DIALOGFLOW
--gcp_credential_path ${GOOGLE_CLOUD_PLATFORM_CREDENTIAL_PATH}
--gcp_project_id ${GOOGLE_CLOUD_PLATFORM_PROJECT_ID}
--noise kitchen

Repeat the above steps for the second agent, but this time import the barista_432.zip file instead of the barista_50.zip file.

Results

In this section, we compare the results we gathered from running both engines. In particular, we are going to look at accuracy, latency, and processing fees.

Accuracy is measured as the percentage of correctly understood speech commands to total command (i.e. Command Acceptance Rate or CAR). Misunderstood speech commands are commands that have incorrect or missing intents, or slots. Let’s look at an example:

"get me a twelve ounce double shot light roast latte with sugar"

Feeding this speech command to the Rhino Speech-to-Intent engine provides us with the following result:

{
intent: "orderDrink",
slots:{
size: "twelve ounce",
sugarAmount: "sugar",
numberOfShots: "double shot"
coffeeDrink: "latte",
roast: "light roast"
}
}

The result identifies the intent and all of the slot values correctly, and it is accepted. However, Dialogflow yields the following result to the same speech command input:

{
"intent": "orderDrink",
"slots": {
"size": "twelve ounce",
"sugarAmount": "sugar",
"numberOfShots": "double shot"
}
}

It turns out that Dialogflow was able to detect the intent but only some of the slot values correctly. It missed coffeeDrink and roast values. Therefore, the result is rejected.

A deeper look into the activity logs reveals why this is happening. Under the hood, Dialogflow uses Google Speech-to-Text transcribe the input speech audio into text. It then processes the output text with its natural language understanding engine. In this particular case, the transcribed text includes several errors:

"get me a 12-ounce double shot by proslot table sugar"

These transcription errors propagate through the next stage of NLU, and ultimately produce the errors we saw in the output.

Accuracy

Below is a graph comparing the detection accuracy of Rhino and Dialogflow where we’ve averaged their respective command acceptance rates (CAR) on noisy commands mixed with cafe and kitchen noise. The purple and red lines show CAR for Google Dialogflow agents that are trained with 432 and 50 sample utterances, respectively.

Detection accuracy in presence of background noise with varying intensity levels
Detection accuracy in presence of background noise with varying intensity levels

These results show that Rhino can be significantly more accurate than Dialogflow. We can see that Rhino can reach up to 98% accuracy, whereas Dialogflow can only reach up to 82% accuracy. Additionally, we see that the number of sample utterances that we used for training had little effect on the accuracy of our Dialogflow agents.

Latency

Dialogflow took about 50 minutes to process 620 requests for each noise environment, which averages to almost 5 seconds per API call. The total processing time for all 7 noise environments was 350 minutes. In contrast, it took approximately 10 minutes for Rhino to finish processing all commands locally on a mid-range consumer laptop.

Cost

We used the standard edition of Google Dialogflow in this tutorial. Although standard edition is free, Google limits the number of API calls to 1,000 per day, which won't be sufficient beyond a prototype. At the time of this article, Dialogflow's enterprise edition charges roughly $0.008 per API call (including audio and text processing). If we were to use the enterprise edition for this tutorial, it would have incurred a total cost of 620x7x$0.008 = $34.72 for each agent and $86 for benchmarking both agents. In contrast, Rhino Speech-to-Intent does not incur a fee per user interaction, and instead the cost is bounded per device install.

Other considerations

Dialogflow has an option called “fuzzy matching” which is a type of parameters extraction that matches an entity approximately rather than exactly. On average, we saw a 2% increase in accuracy with this option enabled when testing the 24 dB speech commands. Although it is not a significant increase, it may be worth considering.