Benchmarking Rhino Speech-to-Intent against Microsoft LUIS
Introduction
Skill level
- Basic knowledge of Python and the command line
- No machine learning knowledge needed
Prerequisites
- Ubuntu x86 (Tested on version 18.04)
- Python 3 and pip installed locally
- Microsoft account or GitHub account
In a previous benchmark tutorial, we showed how to benchmark the accuracy of an example Rhino™ speech-to-intent context designed for a voice-enabled coffee maker.
We will continue our investigation in this tutorial and compare Picovoice’s domain-specific natural language understanding (NLU) engine, Rhino™ Speech-to-Intent, with Microsoft’s cloud-based NLU solution: LUIS.
Microsoft Azure has two Cognitive Services which can be used to detect intent and entities from speech. Azure Language Understanding (LUIS) is a cloud service that enables you to build custom language models that can extract intent and entity information from textual conversational phrases. You can use Azure Speech to Text to transcribe audio files and even create customizable models to improve accuracy on domain-specific speech.
Test speech dataset
Our speech dataset contains 619 sample audio commands from 50 unique speakers who each contributed about 10-15 different commands. We hired freelance voice-over professionals with different accents online for the data collection task. To ensure the benchmark results closely resemble in-the-field performance, it is important to gather audio from a variety of voices that constitute a representative sample of the target user population.
Here is an example of a clean speech command:
We will create noisy speech commands by mixing clean audio with background noise samples to simulate the real-world environment. Because a coffee maker is typically used in a kitchen or coffee shop, we selected background noise representing cafe and kitchen environments (obtained from Freesound):
The NLU engine’s task is to detect the “orderDrink” intent and slots including “coffeeDrink”, “milkAmount”, “numberOfShots”, “roast”, “size”, and “sugarAmount” from these noisy speech commands. Here is an example of intent and slots detected in an utterance:
"Get me a triple shot espresso with some sweetener and a bit of cream."
{intent: "orderDrink",slots: {milkAmount: "a bit of cream",coffeeDrink: "espresso",sugarAmount: "some sweetener",numberOfShots: "triple shot"}}
Getting Started
We recommend completing this tutorial in a Python virtual environment. The code and the audio files used are available on GitHub. To begin, clone the speech to intent benchmark repository and its submodules from GitHub:
git clone --recurse-submodules https://github.com/Picovoice/speech-to-intent-benchmark.git
Make sure you have SSH keys setup with your GitHub account; otherwise, you will need to switch to HTTPS password-based authentication.
Install dependencies including numpy, soundfile, and azure packages using pip
:
pip install -r requirements.txt
Then, install the following packages using APT:
sudo apt-get install libsndfile1sudo apt-get install build-essential libssl1.0.0 libasound2
Mix Clean Speech Data with Noise
We want to evaluate the detection accuracy in the presence of ambient noise that is representative of the real-world scenario. To do that, we will mix the clean speech audio files with our cafe and kitchen noise samples at different intensity levels by running these commands in root directory:
python benchmark/mixer.py cafepython benchmark/mixer.py kitchen
These commands mix background noise with the clean audio files under /speech/clean
at signal-to-noise (SNR) ratios of
24, 21, 18, 15, 12, 9, and 6 decibels (dB) and save the generated audio files under /data/speech
.
The lower the SNR, the noisier the audio data. An SNR value of 0 dB, for instance, means that the signal level and noise level are equal; it is difficult for algorithms and almost impossible for humans to comprehend speech in such an environment. At 3 dB, the signal strength is double the noise level. For instance, home assistants are stress-tested inside sound chambers at ~5-6 dB SNR scenarios, which emulate a typical noisy household environment with the TV on and some kitchen appliances running, at a moderate distance.
Listen to these sample noisy speech commands for an idea of what 6 dB SNR sounds like:
The following sample noisy speech commands are at 24 dB SNR. The noise is noticeable, but significantly less than the 6 dB SNR samples.
Process Files with Picovoice Rhino
We processed audio files of the sample utterances with Rhino in aprevious article and analyzed the end result. We include only the end result below for brevity.

The blue line shows the accuracy measured in terms of Command Acceptance Rate (CAR) at each background noise intensity level (SNR) averaged over cafe and kitchen noise environments. A command is accepted if the intent and all of the slots are understood and detected correctly. The command is rejected if any error occurs in understanding the expression, intent, or slot values.
We will process the same commands with Microsoft LUIS and compare our results.
Process Files with Microsoft LUIS
Sign up for Azure
Sign up for an Azure account using your Microsoft account or GitHub account.
Create a Language Understanding resource
In the Azure portal, create a new Language Understanding resource. While creating this resource, also create a new resource group. Set both the authoring location and prediction location to “(US) West US” and choose the free pricing tier.

Sign up for LUIS
Sign up for a LUIS account. Choose “Use Existing Authoring Resource,” and select the LUIS authoring resource corresponding to the Language Understanding resource you just made.
Create a LUIS app
Once you’re in the LUIS console, import /data/luis/barista.json
as your new app. This file contains the intent and
list entities needed to define a coffee maker context.

Refer to this document for more information on how to customize a LUIS app in the console.
Train and publish your LUIS app
Train and optionally test your LUIS app until satisfied. Publish your app and choose “Staging Slot” as your publishing slot.

Retrieve endpoint key
Authoring Resources and Prediction Resources provide authentication to your LUIS app. One of each was created when you created your Language Understanding resource group. The Authoring Resource limits you to 1000 endpoint requests per month. To be able to make more requests, we will use our Prediction Resource key for authentication.
Select “Manage,” and navigate to “Azure Resources.” Select “Add prediction resource,” so that your existing Prediction
Resource becomes assigned to this LUIS app. Copy either the Primary Key or the Secondary Key into /data/luis/credentials.env
as the LUIS_PREDICTION_KEY
. Copy the Endpoint URL into the same credential file as the LUIS_ENDPOINT_URL
.

Retrieve App ID
Navigate to “Settings” and copy your App ID into the credential file as your LUIS_APP_ID
.

Create a Speech resource
Now we need to feed text output from Speech Services to the LUIS App we just created. To do that, we will create a Speech resource by following this document.
Create a custom speech model
We will create a custom speech model to improve Microsoft’s speech-to-text accuracy with LUIS. The speech-to-text language model will be tuned to better understand context-specific vocabulary and how they show up in utterances by training on sample utterances.
In the Speech Studio, create a new project.
Train speech model
Custom speech models can be trained on a two different data types: audio and human-labeled transcripts and related texts. We provide related text data as a corpus that contains sample utterances with domain-specific phrases.
Upload the provided corpus under /data/watson/corpus.txt
as training data. You can add a pronunciation file
if your domain includes terms with non-standard pronunciations.

When your speech data finishes processing, submit your model for training.

Deploy and get credentials
Deploy your speech model. After successfully deploying, add your Subscription key and Endpoint ID to /data/luis/credentials.env
as SPEECH_KEY
and SPEECH_ENDPOINT_ID
respectively.

Optionally, check your endpoint by uploading an audio file and verifying that your model can recognize the audio.

Process files
Run the noisy speech commands through Microsoft LUIS:
python benchmark/benchmark.py --engine_type MICROSOFT_LUIS--noise cafepython benchmark/benchmark.py --engine_type MICROSOFT_LUIS--noise kitchen
Results: Picovoice Rhino vs Microsoft LUIS
Accuracy
The graph below compares the accuracy of Rhino to LUIS where we have averaged their respective accuracies on noisy commands mixed with cafe and kitchen background noise.

The results show that Microsoft LUIS achieves 93% and Rhino achieves 98% in command acceptance rate under low noise conditions. The performance gap widens slightly as the noise intensity increases with Microsoft LUIS achieving 85% and Rhino achieving 93% CAR in 6 dB SNR environment.
Latency
LUIS takes about 65 minutes to process 619 requests for each noise environment, averaging to about 6 seconds per request. The total processing time for all 7 noise environments was 455 minutes. In contrast, it took approximately 10 minutes (4.5x faster) for Rhino to finish processing all commands locally on a mid-range consumer laptop.
Costs
An Azure free account includes access to services that are free for the first 12 months and access to services that are always free. See this page for more information on specific products. This tutorial uses only the free subscription version of Azure Speech to Text and Azure LUIS, but at the time of writing, each new Azure account includes a $200 credit that can be used towards a paid subscription of any Azure service within the next 30 days.
With a free subscription, you can transcribe up to 5 hours of audio per month and host 1 custom speech model per month. You can make 10,000 requests per month with LUIS.
Under the standard plan, it costs $1.40 per hour to transcribe speech using a custom speech model and $0.0538 per hour to host the custom model. For LUIS, it costs $1.50 per 1000 text requests.
The 619 audio files are collectively about 1.5 hours long. The speech transcription costs 2 x 7 x 1.5 x (1.40 + 0.0538) = $30.53. Processing these text requests with LUIS costs about 2 x 7 x 619 x 1.50 / 1000 = $13.00. Thus, it costs a total of $43.53 to process each speech request in 2 different noise environments at 7 different SNRs with the standard subscription.
Other Considerations
LUIS None intent
The None intent is automatically created and is initially empty. It is intended to be the fallback intent. According to Microsoft, it should contain utterances that are outside of the app domain, ideally 10% of the total utterances.
The LUIS app we provide to you as /data/luis/barista.json
contains an empty None intent. As a result, LUIS may assign
the orderDrink intent to an utterance that would otherwise be out of context. However, after including 43 out-of-context
utterances we experienced no change in accuracy when testing with speech at a SNR of 24 dB.
LUIS entity types
LUIS has 3 other entity types: machine learned, regex, and pattern.any. Depending on your data, you may want to consider using these in addition to or instead of list entities.
Comparison with other solutions
Below is a summary graph comparing the overall performance of Picovoice Rhino, Google Dialogflow, Amazon Lex, IBM Watson, and Microsoft LUIS.

You can access the benchmarking tutorial for each NLU engine included in this graph through the following links: