Tutorial: Benchmarking Rhino Speech-to-Intent against Microsoft LUIS

  • Rhino
  • Speech-to-Intent
  • Microsoft LUIS
  • Cognitive Services
  • Language Understanding
  • Speech Studio
  • Voice Interface
  • Objective Testing

Introduction

Skill level

  • Basic knowledge of Python and the command line
  • No machine learning knowledge needed

Prerequisites

  • Ubuntu x86 (Tested on version 18.04)
  • Python 3 and pip installed locally
  • Microsoft account or GitHub account

In a previous tutorial, we showed how to benchmark the accuracy of an example Rhino™ speech-to-intent context designed for a voice-enabled coffee maker.

We will continue our investigation in this tutorial and compare Picovoice’s domain-specific natural language understanding (NLU) engine, Rhino™ Speech-to-Intent, with Microsoft’s cloud-based NLU solution: LUIS.

Microsoft Azure has two Cognitive Services which can be used to detect intent and entities from speech. Azure Language Understanding (LUIS) is a cloud service that enables you to build custom language models that can extract intent and entity information from textual conversational phrases. You can use Azure Speech to Text to transcribe audio files and even create customizable models to improve accuracy on domain-specific speech.

Test speech dataset

Our speech dataset contains 619 sample audio commands from 50 unique speakers who each contributed about 10-15 different commands. We hired freelance voice-over professionals with different accents online for the data collection task. To ensure the benchmark results closely resemble in-the-field performance, it is important to gather audio from a variety of voices that constitute a representative sample of the target user population.

Here is an example of a clean speech command:

We will create noisy speech commands by mixing clean audio with background noise samples to simulate the real-world environment. Because a coffee maker is typically used in a kitchen or coffee shop, we selected background noise representing cafe and kitchen environments (obtained from Freesound):

The NLU engine’s task is to detect the “orderDrink” intent and slots including “coffeeDrink”, “milkAmount”, “numberOfShots”, “roast”, “size”, and “sugarAmount” from these noisy speech commands. Here is an example of intent and slots detected in an utterance:

"Get me a triple shot espresso with some sweetener and a bit of cream."

{
intent: "orderDrink",
slots: {
milkAmount: "a bit of cream",
coffeeDrink: "espresso",
sugarAmount: "some sweetener",
numberOfShots: "triple shot"
}
}

Getting Started

We recommend completing this tutorial in a Python virtual environment. The code and the audio files used are available on GitHub. To begin, clone the speech to intent benchmark repository and its submodules from GitHub:

git clone --recurse-submodules https://github.com/Picovoice/speech-to-intent-benchmark.git

Make sure you have SSH keys setup with your GitHub account; otherwise, you will need to switch to HTTPS password-based authentication. Install dependencies including numpy, soundfile, and azure packages using pip:

pip install -r requirements.txt

Then, install the following packages using APT:

sudo apt-get install libsndfile1
sudo apt-get install build-essential libssl1.0.0 libasound2

Mix Clean Speech Data with Noise

We want to evaluate the detection accuracy in the presence of ambient noise that is representative of the real-world scenario. To do that, we will mix the clean speech audio files with our cafe and kitchen noise samples at different intensity levels by running these commands in root directory:

python benchmark/mixer.py cafe
python benchmark/mixer.py kitchen

These commands mix background noise with the clean audio files under /speech/clean at signal-to-noise (SNR) ratios of 24, 21, 18, 15, 12, 9, and 6 decibels (dB) and save the generated audio files under /data/speech.

The lower the SNR, the noisier the audio data. An SNR value of 0 dB, for instance, means that the signal level and noise level are equal; it is difficult for algorithms and almost impossible for humans to comprehend speech in such an environment. At 3 dB, the signal strength is double the noise level. For instance, home assistants are stress-tested inside sound chambers at ~5-6 dB SNR scenarios, which emulate a typical noisy household environment with the TV on and some kitchen appliances running, at a moderate distance.

Listen to these sample noisy speech commands for an idea of what 6 dB SNR sounds like:

The following sample noisy speech commands are at 24 dB SNR. The noise is noticeable, but significantly less than the 6 dB SNR samples.

Process Files with Picovoice Rhino

We processed audio files of the sample utterances with Rhino in aprevious article and analyzed the end result. We include only the end result below for brevity.

Detection accuracy in the presence of background noise with varying intensity levels
Detection accuracy in the presence of background noise with varying intensity levels

The blue line shows the accuracy measured in terms of Command Acceptance Rate (CAR) at each background noise intensity level (SNR) averaged over cafe and kitchen noise environments. A command is accepted if the intent and all of the slots are understood and detected correctly. The command is rejected if any error occurs in understanding the expression, intent, or slot values.

We will process the same commands with Microsoft LUIS and compare our results.

Process Files with Microsoft LUIS

Sign up for Azure

Sign up for an Azure account using your Microsoft account or GitHub account.

Create a Language Understanding resource

In the Azure portal, create a new Language Understanding resource. While creating this resource, also create a new resource group. Set both the authoring location and prediction location to “(US) West US” and choose the free pricing tier.

Create a Language Understanding resource

Sign up for LUIS

Sign up for a LUIS account. Choose “Use Existing Authoring Resource,” and select the LUIS authoring resource corresponding to the Language Understanding resource you just made.

Create a LUIS app

Once you’re in the LUIS console, import /data/luis/barista.json as your new app. This file contains the intent and list entities needed to define a coffee maker context.

Create a LUIS app by importing intent and entities

Refer to this document for more information on how to customize a LUIS app in the console.

Train and publish your LUIS app

Train and optionally test your LUIS app until satisfied. Publish your app and choose “Staging Slot” as your publishing slot.

Train and publish LUIS app

Retrieve endpoint key

Authoring Resources and Prediction Resources provide authentication to your LUIS app. One of each was created when you created your Language Understanding resource group. The Authoring Resource limits you to 1000 endpoint requests per month. To be able to make more requests, we will use our Prediction Resource key for authentication.

Select “Manage,” and navigate to “Azure Resources.” Select “Add prediction resource,” so that your existing Prediction Resource becomes assigned to this LUIS app. Copy either the Primary Key or the Secondary Key into /data/luis/credentials.env as the LUIS_PREDICTION_KEY. Copy the Endpoint URL into the same credential file as the LUIS_ENDPOINT_URL.

Add a prediction resource

Retrieve App ID

Navigate to “Settings” and copy your App ID into the credential file as your LUIS_APP_ID.

LUIS app ID
LUIS app ID

Create a Speech resource

Now we need to feed text output from Speech Services to the LUIS App we just created. To do that, we will create a Speech resource by following this document.

Create a custom speech model

We will create a custom speech model to improve Microsoft’s speech-to-text accuracy with LUIS. The speech-to-text language model will be tuned to better understand context-specific vocabulary and how they show up in utterances by training on sample utterances.

In the Speech Studio, create a new project.

Train speech model

Custom speech models can be trained on a two different data types: audio and human-labeled transcripts and related texts. We provide related text data as a corpus that contains sample utterances with domain-specific phrases.

Upload the provided corpus under /data/watson/corpus.txt as training data. You can add a pronunciation file if your domain includes terms with non-standard pronunciations.

Upload sample utterances as training data

When your speech data finishes processing, submit your model for training.

Submit your model for training

Deploy and get credentials

Deploy your speech model. After successfully deploying, add your Subscription key and Endpoint ID to /data/luis/credentials.env as SPEECH_KEY and SPEECH_ENDPOINT_ID respectively.

Deploy your speech model and retrieve credentials

Optionally, check your endpoint by uploading an audio file and verifying that your model can recognize the audio.

Check your speech model’s endpoint
Check your speech model’s endpoint

Process files

Run the noisy speech commands through Microsoft LUIS:

python benchmark/benchmark.py --engine_type MICROSOFT_LUIS
--noise cafe
python benchmark/benchmark.py --engine_type MICROSOFT_LUIS
--noise kitchen

Results: Picovoice Rhino vs Microsoft LUIS

Accuracy

The graph below compares the accuracy of Rhino to LUIS where we have averaged their respective accuracies on noisy commands mixed with cafe and kitchen background noise.

Comparing Picovoice Rhino and Microsoft LUIS detection accuracy in the presence of background noise with varying intensity levels
Comparing Picovoice Rhino and Microsoft LUIS detection accuracy in the presence of background noise with varying intensity levels

The results show that Microsoft LUIS achieves 93% and Rhino achieves 98% in command acceptance rate under low noise conditions. The performance gap widens slightly as the noise intensity increases with Microsoft LUIS achieving 85% and Rhino achieving 93% CAR in 6 dB SNR environment.

Latency

LUIS takes about 65 minutes to process 619 requests for each noise environment, averaging to about 6 seconds per request. The total processing time for all 7 noise environments was 455 minutes. In contrast, it took approximately 10 minutes (4.5x faster) for Rhino to finish processing all commands locally on a mid-range consumer laptop.

Costs

An Azure free account includes access to services that are free for the first 12 months and access to services that are always free. See this page for more information on specific products. This tutorial uses only the free subscription version of Azure Speech to Text and Azure LUIS, but at the time of writing, each new Azure account includes a $200 credit that can be used towards a paid subscription of any Azure service within the next 30 days.

With a free subscription, you can transcribe up to 5 hours of audio per month and host 1 custom speech model per month. You can make 10,000 requests per month with LUIS.

Under the standard plan, it costs $1.40 per hour to transcribe speech using a custom speech model and $0.0538 per hour to host the custom model. For LUIS, it costs $1.50 per 1000 text requests.

The 619 audio files are collectively about 1.5 hours long. The speech transcription costs 2 x 7 x 1.5 x (1.40 + 0.0538) = $30.53. Processing these text requests with LUIS costs about 2 x 7 x 619 x 1.50 / 1000 = $13.00. Thus, it costs a total of $43.53 to process each speech request in 2 different noise environments at 7 different SNRs with the standard subscription.

Other Considerations

LUIS None intent

The None intent is automatically created and is initially empty. It is intended to be the fallback intent. According to Microsoft, it should contain utterances that are outside of the app domain, ideally 10% of the total utterances.

The LUIS app we provide to you as /data/luis/barista.json contains an empty None intent. As a result, LUIS may assign the orderDrink intent to an utterance that would otherwise be out of context. However, after including 43 out-of-context utterances we experienced no change in accuracy when testing with speech at a SNR of 24 dB.

LUIS entity types

LUIS has 3 other entity types: machine learned, regex, and pattern.any. Depending on your data, you may want to consider using these in addition to or instead of list entities.

Comparison with other solutions

Below is a summary graph comparing the overall performance of Picovoice Rhino, Google Dialogflow, Amazon Lex, IBM Watson, and Microsoft LUIS.

Average Accuracy of Microsoft LUIS against other Natural Language Understanding engines
Average Accuracy of Microsoft LUIS against other Natural Language Understanding engines

You can access the benchmarking tutorial for each NLU engine included in this graph through the following links: