Benchmarking a Wake Word Detection Engine

  • Hotword
  • Wake up word
  • False Alarm
August 14, 2019
Blog Thumbnail

Wake word detection is used in almost every application with a Voice User Interface (VUI) supporting hands-free operation. Wake phrases, such as “Hey Siri” or “OK Google” mimic the notion of calling a person’s name to grab their attention to start a conversation.

Although their function is seemingly simple, implementing a robust wake word algorithm is a surprisingly daunting challenge. Companies like Apple, Google, and Amazon have teams of engineers working on wake word detection algorithms

Below, we look at the most important parameters for objectively benchmarking wake word engines.

Miss Rate & False Alarm per Hour

The accuracy of a binary classifier (any decision-making algorithm with a yes/no output) can be measured by two parameters: false rejection rate (FRR) and false acceptance rate (FAR). A wake word detector is a binary classifier. Hence, we can use these metrics to benchmark it.

The detection threshold of binary classifiers can be tuned to balance FRR and FAR. A lower detection threshold yields higher sensitivity. A highly-sensitive classifier has a high FAR and low FRR value (i.e. it accepts almost everything). A receiver operating characteristic (ROC) curve plots FRR values against corresponding FAR values for varying sensitivity values as in figure below.

Receiver Operating Characteristic (ROC) curve

Better algorithms have a lower false rejection rate for any given false acceptance rate. To combine these two metrics into one, sometimes ROC curves are compared by their Area Under Curve (AUC). Smaller AUC indicates superior performance. In Figure below, algorithm C has better performance (and lower area under curve) than B, and B is better than A.

Area under curve of ROC

For a given sensitivity value, FRR is measured by playing a set of sample audio files that include the utterance of the wake word, and then calculating the ratio of rejections to the total number of samples. FAR is usually measured by playing a long background audio file which must not include any utterance of the wake word, but instead includes noise, speech, music, etc. The FAR is calculated by dividing the number of false acceptances by the length of the background audio in hours.

We have previously benchmarked the Picovoice wake word detection software against alternative solutions and shared the results publicly. The figure below shows ROC curves for the wake word “Jarvis”, comparing the Picovoice Porcupine wake word engine against Snowboy (KITT.AI) and PocketSphinx.

Jarvis keyword ROC for Porcupine, PorcupineCompressed, PocketSphinx, and Snowboy

Based on the ROC curves above, the Porcupine standard model achieves the best accuracy.

Resource Utilization

Since a wake word detection algorithm is always listening, it must be resource-efficient. On battery-powered devices such as laptops and smartphones, higher CPU usage directly increases power consumption and drains the battery more quickly.

Picovoice technology leverages efficient meta-learning strategies to train compressed speech DNN models. The models are also optimized for fixed-point implementation. On some processors, architecture-dependent instruction sets are leveraged to further reduce the number of CPU cycles needed to perform multiply and accumulate operations. For example, the Picovoice wake word detection software uses NEON™ technology on on Arm® Cortex®-A, which is a Single Instruction Multiple Data (SIMD) architecture extension, to lower CPU usage and accelerate processing.

Figure below compares the benchmark results for CPU usage of the Picovoice wake word software against Snowboy and PocketSphinx on Raspberry Pi 3 (ARM® Cortex®-A53).

Average CPU usage on Raspberry Pi 3


When comparing the performance of different wake word engines consider plotting their ROC curves and measuring their resource utilization. To help you with this process, we have open sourced our benchmarking framework and sample audio files here.