Benchmarking a
Wake Word Detection Engine

Wake word detection is used in almost every application with a Voice User Interface (VUI) supporting hands-free operation. Wake phrases, such as “Hey Siri” or “OK Google” mimic the notion of calling a person’s name to grab their attention to start a conversation.

Although their function is seemingly simple, implementing a robust wake word algorithm is a surprisingly daunting challenge. Companies like Apple, Google, and Amazon have teams of engineers working on wake word detection algorithms

Below, we look at the most important parameters for objectively benchmarking wake word engines.

Miss Rate & False Alarm per Hour

The accuracy of a binary classifier (any decision-making algorithm with a yes/no output) can be measured by two parameters: false rejection rate (FRR) and false acceptance rate (FAR). A wake word detector is a binary classifier. Hence, we can use these metrics to benchmark it.

The detection threshold of binary classifiers can be tuned to balance FRR and FAR. A lower detection threshold yields higher sensitivity. A highly-sensitive classifier has a high FAR and low FRR value (i.e. it accepts almost everything). A receiver operating characteristic (ROC) curve [1] plots FRR values against corresponding FAR values for varying sensitivity values as in figure below.

Better algorithms have a lower false rejection rate for any given false acceptance rate. To combine these two metrics into one, sometimes ROC curves are compared by their Area Under Curve (AUC) [2] . Smaller AUC indicates superior performance. In Figure below, algorithm C has better performance (and lower area under curve) than B, and B is better than A.

For a given sensitivity value, FRR is measured by playing a set of sample audio files that include the utterance of the wake word, and then calculating the ratio of rejections to the total number of samples. FAR is usually measured by playing a long background audio file which must not include any utterance of the wake word, but instead includes noise, speech, music, etc. The FAR is calculated by dividing the number of false acceptances by the length of the background audio in hours.

We have previously benchmarked the Picovoice wake word detection software against alternative solutions and published the results publicly [ 3 ]. Figure below shows ROC curves for the wake word “Jarvis” comparing Picovoice Porcupine wake word software against Snowboy (KITT.AI) and PocketSphinx.

Based on the ROC curves above, the Porcupine standard model achieves the best accuracy.

Resource Utilization

Since a wake word detection algorithm is always listening, it must be resource-efficient. On battery-powered devices such as laptops and smartphones, higher CPU usage directly increases power consumption and drains the battery more quickly.

Picovoice technology leverages efficient meta-learning strategies to train compressed speech DNN models. The models are also optimized for fixed-point implementation. On some processors, architecture-dependent instruction sets are leveraged to further reduce the number of CPU cycles needed to perform multiply and accumulate operations. For example, the Picovoice wake word detection software uses NEON™ technology on on Arm® Cortex®-A, which is a Single Instruction Multiple Data (SIMD) architecture extension, to lower CPU usage and accelerate processing.

Figure below compares the benchmark results for CPU usage of the Picovoice wake word software against Snowboy and PocketSphinx on Raspberry Pi 3 (ARM® Cortex®-A53).

Conclusion

When comparing the performance of different wake word engines consider plotting their ROC curves and measuring their resource utilization. To help you with this process, we have open sourced our benchmarking framework and sample audio files here.

Contact Enterprise Team