What is Voice Activity Detection?

🎯 Voice AI Consulting

Get dedicated support and consultation to ensure your specific needs are met.

Voice Activity Detection (VAD) is a binary classifier that detects the presence of human speech in audio. VAD is one of the main building blocks of many Speech Processing, Speech Recognition, and Speaker Recognition systems.

In Speech Coding, VAD compresses segments of audio that do not contain voice.
In Voice User Interfaces, VAD initiates the system after detecting voice activity.
In Speech-to-Text, VAD is used to mark the end of the utterance.
In media applications, VAD finds voiced segments within large audio files.

Challenges

The main challenge of VAD is distinguishing between noise and the human voice. In quiet environments VAD faces little issue - it is very easy to discern between silence and a speaking individual. With the addition of background noise (such as a fan or car engine), detecting voices becomes a challenging task. To aid VAD, Digital Signal Processing (DSP) algorithms help by denoising or providing the ability to inspect spectral features. However, DSP algorithms fail when the nature of noise approaches human speech, i.e. babble noise. A Deep Learning Model can learn the subtle differences between voice-like noises and human voice activity to improve VAD accuracy.

Performance Metrics

Accuracy

We measure the accuracy of a VAD using two metrics: True Positive Rate (TPR) and False Positive Rate (FPR). A Receiver Operator Characteristics (ROC) curve depicts these two metrics in one graph. The ROC curve is a known tool for inspecting the performance of binary classifiers across different decision thresholds. It allows the designer to study the interplay of detection rate vs false positive rate.

The figure below compares the accuracy of the Picovoice Cobra VAD engine against WebRTC VAD. You can see that Cobra gives a higher TPR at any given FPR, hence is more accurate.

Latency

Do you need a real-time response? If yes, offline implementations of VAD are not an option, as they need to have the entire audio data to start processing. Streaming architectures can ingest and process audio data in chunks (frames). Each streaming VAD has a fixed latency which varies from milliseconds to seconds.

Runtime Efficiency

For battery-powered applications, an efficient VAD extends battery life and hence improves user experience. Alternatively, an efficient VAD implementation can process massive amounts of data much faster at scale.