Voice Activity Detection (VAD) is a binary classifier that detects the presence of human speech in audio. VAD is one of the main building blocks of many Speech Processing, Speech Recognition, and Speaker Recognition systems.

  • In Speech Coding, VAD compresses segments of audio that do not contain voice.
  • In Voice User Interfaces, VAD initiates the system after detecting voice activity.
  • In Speech-to-Text, VAD is used to mark the end of the utterance.
  • In media applications, VAD finds voiced segments within large audio files.

Challenges

The main challenge of VAD is to distinguish noise from the human voice. In a quiet environment, it is trivial to recognize voice activity based on the loudness level. The addition of background noise makes VAD harder. e.g. a fan or car engine. Digital Signal Processing (DSP) algorithms help by either denoising or providing the ability to inspect spectral features. When the nature of noise gets closer to human speech, e.g. babble noise, DSP algorithms fail. A Deep Learning Model can learn the subtle differences between voice-like noises and human voice activity.

Performance Metrics

Accuracy

We measure the accuracy of a VAD using two metrics: True Positive Rate (TPR) and False Positive Rate (FPR). A Receiver Operator Characteristics (ROC) curve depicts these two metrics in one graph. The ROC curve is a known tool for inspecting the performance of binary classifiers across different decision thresholds. It allows the designer to study the interplay of detection rate vs false positive rate.

The figure below compares the accuracy of the Picovoice Cobra VAD engine against WebRTC VAD. You can see that Cobra gives a higher TPR at any given FPR, hence is more accurate.

ROC Curve Comparing Picovoice Cobra VAD with WebRTC Voice Activity Detector

Latency

Do you need a real-time response? If yes, offline implementations of VAD are not an option, as they need to have the entire audio data to start processing. Streaming architectures can ingest and process audio data in chunks (frames). Each streaming VAD has a fixed latency which varies from milliseconds to seconds.

Runtime Efficiency

For battery-powered applications, an efficient VAD extends battery life and hence improves user experience. Alternatively, an efficient VAD implementation can process massive amounts of data much faster at scale.