Voice Activity Detection Benchmark
Voice activity detection (VAD) is the recognition of human speech within a stream of audio. Voice activity detection is one of the main building blocks of speech-enabled applications. VAD accuracy has a compounding effect on the system performance as many downstream speech processing blocks depend on it. VoIP, IVR, telemarketing, and security systems incorporate voice activity detection.
Picovoice’s Cobra VAD engine is a cross-platform, extremely efficient, and real-time VAD that achieves best-in-class accuracy. Below is a series of benchmarks to validate the accuracy claims.
Methodology
Engines
We compare the accuracy of Cobra with the voice activity detector used in WebRTC. The VAD that Google developed for the WebRTC project is reportedly one of the best.
Speech Corpus
We use LibriSpeech (test-clean
portion) as the speech corpus. It provides a diverse number
of speakers and is gender-balanced.
Noise
The real challenge in building a performant VAD is resilience to noise. To test out the effect of noise, we mix noise with speech data before feeding it to VAD engines. For this purpose, we use the DEMAND dataset that contains noise recordings in diverse environments.
Metric
We use the receiver operator characteristics curve. The ROC curve is a known tool for inspecting the performance of binary classifiers across different decision thresholds. It allows the designer to study the interplay of detection rate vs false positive rate.
Results
Usage
The data and code used to create this benchmark are available on GitHub under the permissive Apache 2.0 license. Detailed instructions for benchmarking individual engines are in the following documents: