Voice Activity Detection (
VAD) is a binary classifier that detects the presence of human speech in audio.
VAD is one
of the main building blocks of many
Speech Recognition, and
Speaker Recognition systems.
VADcompresses segments of audio that do not contain voice.
Voice User Interfaces,
VADinitiates the system after detecting voice activity.
VADis used to mark the end of the utterance.
- In media applications,
VADfinds voiced segments within large audio files.
The main challenge of
VAD is to distinguish noise from the human voice. In a quiet environment, it is trivial to
recognize voice activity based on the loudness level. The addition of background noise makes
VAD harder. e.g. a fan or
Digital Signal Processing (
DSP) algorithms help by either denoising or providing the ability to inspect
spectral features. When the nature of noise gets closer to human speech, e.g. babble noise,
DSP algorithms fail.
Deep Learning Model can learn the subtle differences between voice-like noises and human voice activity.
We measure the accuracy of a
VAD using two metrics:
True Positive Rate (
False Positive Rate (
Operator Characteristics (
ROC) curve depicts these two metrics in one graph. The
ROC curve is a known tool for inspecting
the performance of binary classifiers across different decision thresholds. It allows the designer to study the interplay
of detection rate vs false positive rate.
The figure below compares the accuracy of the Picovoice Cobra VAD engine against WebRTC VAD. You can see that Cobra gives
TPR at any given
FPR, hence is more accurate.
Do you need a real-time response? If yes, offline implementations of
VAD are not an option, as they need to have the
entire audio data to start processing. Streaming architectures can ingest and process audio data in chunks (frames).
VAD has a fixed latency which varies from milliseconds to seconds.
For battery-powered applications, an efficient
VAD extends battery life and hence improves user experience. Alternatively,
VAD implementation can process massive amounts of data much faster at scale.