Wake Word Benchmark

A wake word engine is a speech recognition algorithm that detects utterances of a given keyword (keyphrase) within a stream of audio. Recently, the most common use case of a wake word engine has become voice activation (i.e. wake phrase detection). But it can (sometimes should) be used for implementing always-listening voice commands and voice search. Keyword spotting usage is beyond voice assistants and encompasses voice user interfaces (VUIs), interactive voice response (IVR) systems, voice analytics (e.g. call centres), content moderation (e.g. online games), and many more.

The major obstacle in the adoption of wake word engines is their reliance on massive data gathering for training each new model. Picovoice Porcupine solves this problem by removing the need for data gathering for each new model. You can train custom branded wake word models using Picovoice Console by typing the phrase you want. A production-ready model will be ready in a few seconds.

Below is a series of benchmarks to back our claims. They also empower customers to make data-driven decisions using the datasets that matter to their business.

Methodology

Speech Corpus

LibriSpeech (test_clean portion) is used as background dataset. It can be downloaded from OpenSLR.

Furthermore, more than 300 recordings of six keywords (alexa, computer, jarvis, smart mirror, snowboy, and view glass) from more than 50 distinct speakers are used. The recordings are crowd-sourced. The recordings are stored within the repository here.

Resilience to Noise

In order to simulate real-world situations, the data is mixed with noise (at 10 dB SNR). For this purpose, we use DEMAND dataset which has noise recording in 18 different environments (e.g. kitchen, office, traffic, etc.). It can be downloaded from Kaggle.

Metrics

We compare accuracies of engines under test at a fixed false alarm rate of 1 per 10 hours. For runtime, we consider the CPU usage on a Raspberry Pi 3.

Results

Accuracy

The figure below shows the accuracy of engines when the false alarm rate is 1 per 10 hours with noise (10 dB SNR) and background speech.

Efficiency

The figure below depicts CPU usage (single-core) of each engine on a Raspberry Pi 3.

Was this doc helpful?

Issue with this doc?