Voice Activity Detection
(VAD
) is a binary classifier that detects the presence of human speech in audio. VAD
is one
of the main building blocks of many Speech Processing
, Speech Recognition
, and Speaker Recognition
systems.
- In
Speech Coding
,VAD
compresses segments of audio that do not contain voice. - In
Voice User Interfaces
,VAD
initiates the system after detecting voice activity. - In
Speech-to-Text
,VAD
is used to mark the end of the utterance. - In media applications,
VAD
finds voiced segments within large audio files.
Challenges
The main challenge of VAD
is to distinguish noise from the human voice. In a quiet environment, it is trivial to
recognize voice activity based on the loudness level. The addition of background noise makes VAD
harder. e.g. a fan or
car engine. Digital Signal Processing
(DSP
) algorithms help by either denoising or providing the ability to inspect
spectral features. When the nature of noise gets closer to human speech, e.g. babble noise, DSP
algorithms fail.
A Deep Learning Model
can learn the subtle differences between voice-like noises and human voice activity.
Performance Metrics
Accuracy
We measure the accuracy of a VAD
using two metrics: True Positive Rate
(TPR
) and False Positive Rate
(FPR
). A Receiver
Operator Characteristics
(ROC
) curve depicts these two metrics in one graph. The ROC
curve is a known tool for inspecting
the performance of binary classifiers across different decision thresholds. It allows the designer to study the interplay
of detection rate vs false positive rate.
The figure below compares the accuracy of the Picovoice Cobra VAD engine against WebRTC VAD. You can see that Cobra gives
a higher TPR
at any given FPR
, hence is more accurate.
Latency
Do you need a real-time response? If yes, offline implementations of VAD
are not an option, as they need to have the
entire audio data to start processing. Streaming architectures can ingest and process audio data in chunks (frames).
Each streaming VAD
has a fixed latency which varies from milliseconds to seconds.
Runtime Efficiency
For battery-powered applications, an efficient VAD
extends battery life and hence improves user experience. Alternatively,
an efficient VAD
implementation can process massive amounts of data much faster at scale.