Natural Language Understanding Benchmark

Understanding voice command is the core functionality of a voice assistant. The dominant approach breaks this task into speech-to-text (STT) and natural language understanding (NLU). Amazon Lex, Google Dialogflow, IBM Watson, and Microsoft LUIS use this strategy. Picovoice’s Rhino Speech-to-Intent takes a different approach. It fuses these two steps and builds an end-to-end model that directly infers meaning, including intent and entities, from voice commands. We claim that this end-to-end approach gives a significant accuracy boost. Also, it massively reduces the operating cost of voicebots and voice user interfaces (VUI).

Below is a series of benchmarks to foster these claims and track their validity over time. The benchmark is open-source and reproducible.

Methodology

Speech Corpus

We consider a voice-enabled coffee maker as the test case. We have crowd-sourced spoken utterances from 50 speakers. Each speaker contributed between 10 and 15 voice commands. Speakers have diverse accents and The corpus is gender-balanced. The utterances are recorded in quiet environments.

Noise

To simulate real-world environments, we mix utterances with noise before feeding them to NLU engines. The noise is mixed at various signal-to-noise (SNR) ratios to study the effect of noise level on the accuracy of voice assistant APIs.

Metrics

Command Acceptance Rate

The accuracy of a VUI can be measured in many ways including Precision, recall, F-score, and confusion matrices. We decided to use an intuitive metric which we call "Command Acceptance Rate". It is the percentage of voice commands that their intent and slots are correctly inferred by the voice assistant. Hence, an incorrect intent or even a single incorrect slot value is considered an error.

Operational Cost

If the voice interface of a coffee maker costs $10 a month, that coffee maker won't become a household item. Hence, we compare the annual operational cost of NLU engines per user. This cost for all NLU APIs is a function of the number of user interactions per day (i.e. usage) and is essentially uncapped as is common in Software-as-a-Service (SaaS) business models. This is different from Picovoice Rhino which offers unlimited voice interactions and predictable pricing.

Results

Command Acceptance Rate

The figure below shows the accuracy of each engine averaged over all utterances and all SNRs.

A more detailed view can be looking at how each engine copes with noise as is shown in SNR dependent figure below.

NLU accuracy comparison across different SNRs

Usage

The data and code used to create this benchmark are available on GitHub under the permissive Apache 2.0 license. Detailed instructions for benchmarking individual engines are provided in the following documents:

Different Use Cases

Voice assistants can do much more than brewing java. They are already used in IVR, customer service, sales, finance, healthcare, touchless interfaces, and many more verticals that can benefit from conversational AI. If you have a different use case you can still use this framework to benchmark different vendors to make a data-driven decision. All you need to do is to replace the data and labels in the GitHub repository, retrain the NLU engines for your domain of interest and simply run the benchmark again.

Was this doc helpful?

Issue with this doc?