Voice Activity Detection
(VAD
) is a binary classifier that detects the presence of human speech in audio. VAD
is one
of the main building blocks of many Speech Processing
, Speech Recognition
, and Speaker Recognition
systems.
- In
Speech Coding
,VAD
compresses segments of audio that do not contain voice. - In
Voice User Interfaces
,VAD
initiates the system after detecting voice activity. - In
Speech-to-Text
,VAD
is used to mark the end of the utterance. - In media applications,
VAD
finds voiced segments within large audio files.
Challenges
The main challenge of VAD
is distinguishing between noise and the human voice. In quiet environments VAD
faces little issue - it is very easy to discern between silence and a speaking individual. With the addition of background noise (such as a fan or car engine), detecting voices becomes a challenging task. To aid VAD
, Digital Signal Processing
(DSP
) algorithms help by denoising or providing the ability to inspect spectral features. However, DSP algorithms fail when the nature of noise approaches human speech, i.e. babble noise. A Deep Learning Model
can learn the subtle differences between voice-like noises and human voice activity to improve VAD
accuracy.
Performance Metrics
Accuracy
We measure the accuracy of a VAD
using two metrics: True Positive Rate
(TPR
) and False Positive Rate
(FPR
). A Receiver
Operator Characteristics
(ROC
) curve depicts these two metrics in one graph. The ROC
curve is a known tool for inspecting
the performance of binary classifiers across different decision thresholds. It allows the designer to study the interplay
of detection rate vs false positive rate.
The figure below compares the accuracy of the Picovoice Cobra VAD engine against WebRTC VAD. You can see that Cobra gives
a higher TPR
at any given FPR
, hence is more accurate.
Latency
Do you need a real-time response? If yes, offline implementations of VAD
are not an option, as they need to have the
entire audio data to start processing. Streaming architectures can ingest and process audio data in chunks (frames).
Each streaming VAD
has a fixed latency which varies from milliseconds to seconds.
Runtime Efficiency
For battery-powered applications, an efficient VAD
extends battery life and hence improves user experience. Alternatively,
an efficient VAD
implementation can process massive amounts of data much faster at scale.