Tutorial: Benchmarking Rhino Speech-to-Intent against IBM Watson

  • Rhino
  • Speech-to-Intent
  • IBM Watson
  • Natural Language Understanding
  • Knowledge Studio
  • Voice Interface
  • Objective Testing

Introduction

Skill level

  • Basic knowledge of Python and the command line
  • No machine learning knowledge needed

Prerequisites

  • Ubuntu x86 (Tested on version 18.04)
  • Python 3 and pip installed locally
  • IBM Cloud account

In a previous tutorial, we demonstrated how to benchmark the accuracy of an example Rhino™ speech-to-intent context designed for a voice-enabled coffee maker.

We will continue our investigation in this tutorial and compare Picovoice’s domain-specific natural language understanding (NLU) engine, Rhino™ Speech-to-Intent, with IBM’s Watson cloud-based NLU solution.

The IBM Watson Natural Language Understanding service is a cloud offering that is capable of extracting metadata from text. For speech input, you can use Watson Speech to Text to transcribe audio files. For domain-specific contexts, you can use the customization interface to create a custom language model to improve speech recognition performance. Watson Knowledge Studio is used together with the NLU service to create custom rule-based or machine-learning NLU models.

Test speech dataset

Our speech dataset contains 619 sample audio commands from 50 unique speakers who each contributed about 10-15 different commands. We hired freelance voice-over professionals with different accents online for the data collection task. To ensure the benchmark results closely resemble in-the-field performance, it’s important to gather audio from a variety of voices that constitute a representative sample of the target user population.

Here is an example of a clean speech command:

We will create noisy speech commands by mixing clean audio with background noise samples to simulate the real-world environment. Since a coffee maker is typically used in a kitchen or coffee shop, we selected background noise representing cafe and kitchen environments (obtained from Freesound):

The NLU engine’s task is to detect the “orderDrink” intent and slots including “coffeeDrink”, “numberOfShots”, “roast”, “size”, “milkAmount”, and “sugarAmount” from these noisy speech commands. Here’s an example of intent and slots detected in an utterance:

"I want a sixteen-ounce single-shot mocha with lots of skim milk."

{
intent: "orderDrink",
slots: {
milkAmount: "lots of skim milk",
coffeeDrink: "mocha",
size: "sixteen ounce",
numberOfShots: "single shot"
}
}

Getting Started

We recommend completing this tutorial in a Python virtual environment. The code and the audio files used are available on GitHub.

To begin, clone the speech to intent benchmark repository and its submodules from GitHub:

git clone --recurse-submodules https://github.com/Picovoice/speech-to-intent-benchmark.git

Make sure you have SSH keys setup with your GitHub account; otherwise, you will need to switch to HTTPS password-based authentication. Install dependencies including numpy, soundfile, ibm-watson, and matplotlib using pip:

pip install -r requirements.txt

Then, install libsndfile using the package manager:

sudo apt-get install libsndfile1

Mix Clean Speech Data with Noise

We want to evaluate the detection accuracy in the presence of ambient noise that is representative of the real-world scenario. To do that, we will mix the clean speech audio files with our cafe and kitchen noise samples at different intensity levels by running the following commands in root directory:

python benchmark/mixer.py cafe
python benchmark/mixer.py kitchen

These commands mix background noise with the clean audio files under /speech/clean at signal-to-noise (SNR) ratios of 24, 21, 18, 15, 12, 9, and 6 decibels (dB) and save the generated audio files under /data/speech.

The lower the SNR, the noisier the audio data. An SNR value of 0 dB, for instance, means that the signal level and noise level are equal; it is difficult for algorithms and almost impossible for humans to comprehend speech in such an environment. At 3 dB, the signal strength is double the noise level. For example, home assistants are stress-tested inside sound chambers at ~5-6 dB SNR scenarios, which emulate a typical noisy household environment with the TV on and some kitchen appliances running, at a moderate distance.

Listen to these sample noisy speech commands for an idea of what 6 dB SNR sounds like:

The following sample noisy speech commands are at 24 dB SNR. The noise is noticeable, but significantly less than the 6 dB SNR samples.

Process Files with Picovoice Rhino

In a previous article, we processed audio files of the sample utterances with Rhino and analyzed the end result. We include only the end result below for brevity.

Detection accuracy in the presence of background noise with varying intensity levels
Detection accuracy in the presence of background noise with varying intensity levels

The blue line shows the accuracy measured in terms of Command Acceptance Rate (CAR) at each background noise intensity level (SNR) averaged over cafe and kitchen noise environments. A command is accepted if the intent and all of the slots are understood and detected correctly. The command is rejected if any error occurs in understanding the expression, intent, or slot values.

We will next process the same commands with IBM Watson and compare our results.

Process Files with IBM Watson

Create a Natural Language Understanding service

Create a NLU service here.

On the "Manage" page, download your credentials. The file is called ibm-credentials.env by default.

Download your Natural Language Understanding service credentials
Download your Natural Language Understanding service credentials

Create a Speech to Text service

Create a standard plan Speech to Text service here.

A paid plan such as the pay-as-you-go plan is required to use a custom language model.

On the “Manage” page, download your credentials and copy the contents of this file into ibm-credentials.env from the previous step.

Create a custom language model for speech-to-text

IBM Speech to Text service offers a customization interface that we can use to improve the accuracy and latency of speech recognition requests by customizing a base language model for our domain.

A custom language model is used to expand the vocabulary of the base model to include domain-specific language. We supply the Speech to Text service with a text document containing terminology from the domain in context (a corpus), and the service will extract terms from the corpus to build the model’s vocabulary. You can also add custom words to the model individually and specify the pronunciation of these words and the pronunciation of words extracted from corpora.

This doc outlines the steps to creating a custom language model and this doc shows how to use it in speech recognition requests.

In engine.py under /benchmark, we specify the base model that is to be customized, add a corpus containing sample utterances, and use it to train a custom language model. When we process the noisy speech commands, we automatically use this new language model with Speech to Text service.

Create a Knowledge Studio service

Create a Knowledge Studio service here, and create a new Workspace.

Create entity types

In your new Workspace, upload the previously created type system entity_types.json found in /data/watson/.

Import Entity Types

Create classes for rule-based model

In the “Rules” page under “Rule-based Model”, create a class for each entity type.

Create a class for each entity type

Import dictionaries

In the “Dictionaries” page, import barista_dictionaries.zip. Select the corresponding entity type and corresponding rule-class for each dictionary.

Upload dictionaries and match each dictionary with its entity type and rule class

If you return to the “Rules” page under “Rule-based Model” and add sample utterances as a “Document,” you’ll see that your model is able to match each phrase to its corresponding class. Sample utterances are available in /data/watson/corpus.txt.

The rule-based model matches each phrase to its corresponding class
The rule-based model matches each phrase to its corresponding class

Rule-based model type mapping

In the “Versions” page, go to the “Rule-based Model Type Mapping” tab and map each entity type to the corresponding class.

Map each entity type to its corresponding class

Deploy model to Natural Language Understanding

Return to the “Rule-based Model” page and save for deployment. You should see a model with version number 1.0. Deploy this model to Natural Language Understanding, and take note of your model ID.

Save and deploy your rule-based model to your Natural Language Understanding service

If your model’s status is “NLU” and “available,” then your model deployed successfully and is ready to be used.

Process files

Run the noisy speech commands through IBM Watson:

If you created your own custom Speech to Text language model, include its ID using the argument ibm_custom_id. Otherwise, a new custom language model will be made automatically.

python benchmark/benchmark.py
--engine_type IBM_WATSON
--ibm_credential_path ${IBM_CREDENTIAL_PATH}
--ibm_model_id ${IBM_MODEL_ID}
--noise cafe
python benchmark/benchmark.py
--engine_type IBM_WATSON
--ibm_credential_path ${IBM_CREDENTIAL_PATH}
--ibm_model_id ${IBM_MODEL_ID}
--noise kitchen

Results: Picovoice Rhino vs IBM Watson

The graph below compares the accuracy of Rhino to Watson where we have averaged their respective accuracies on noisy commands mixed with cafe and kitchen background noise.

Comparing Picovoice Rhino and IBM Watson detection accuracy in the presence of background noise with varying intensity levels
Comparing Picovoice Rhino and IBM Watson detection accuracy in the presence of background noise with varying intensity levels

In low noise scenarios, both engines deliver a similar performance, with IBM Watson achieving 96% and Rhino achieving 98% CAR. However, the performance gap widens quickly as the noise intensity increases with IBM Watson achieving 60% and Rhino achieving 93% CAR in 6 dB SNR environment.

Latency

IBM Watson takes about 100 minutes to process 619 requests for each noise environment, averaging to almost 10 seconds per request. The total processing time for all 7 noise environments was 700 minutes. In contrast, it took approximately 10 minutes (70x faster) for Rhino to finish processing all commands locally on a mid-range consumer laptop.

Cost

At this time of the article, custom Speech to Text language models aren’t available for the Lite tier; they are available under the Standard plan or the Premium plan. To access the Standard tier, you could upgrade to a Pay-as-you-go account and get started with a $200 credit that is valid for the next 35 days.

At the time of writing, Standard tier pricing with a custom language model is the base price of $0.02 per minute + $0.03 per minute for using a custom model, which comes to a total of $0.05 per minute. The 619 audio files are collectively about 89 minutes long. It costs 2 x 7 x 89 x $0.05 = $62.30 to process each speech request in 2 different noise environments at 7 different SNRs.

Other Considerations

IBM Speech to Text customization

The IBM Speech to Text customization interface offers other options to enhance speech recognition capabilities: grammars and custom acoustic models.

A custom language model can be used together with a grammar. A grammar restricts the words the Speech to Text service can recognize from its base vocabulary, enabling faster and more accurate results.

A custom acoustic model can improve speech recognition when the acoustic environment is unique or noisy, speakers’ speech patterns are atypical, or speakers’ accents are pronounced. When you create a custom acoustic model and add audio data that matches the kind of audio you plan to transcribe, the customization interface adapts the base model to your environment and speakers enabling more accurate results.

Custom acoustic models can be used alone or together with custom language models and grammars.

Comparison with other solutions

For the sake of comparison, below is a graph comparing the overall performance of Picovoice Rhino, Google Dialogflow, Amazon Lex, and IBM Watson.

Average Accuracy of IBM Watson against other Natural Language Understanding engines
Average Accuracy of IBM Watson against other Natural Language Understanding engines

Through the following links, you can access the benchmarking tutorial for each NLU engine included in this graph: