Voice Activity Detection (VAD) is a binary classifier that detects the presence of human speech in audio. VAD is one
of the main building blocks of many Speech Processing, Speech Recognition, and Speaker Recognition systems.
- In 
Speech Coding,VADcompresses segments of audio that do not contain voice. - In 
Voice User Interfaces,VADinitiates the system after detecting voice activity. - In 
Speech-to-Text,VADis used to mark the end of the utterance. - In media applications, 
VADfinds voiced segments within large audio files. 
Challenges
The main challenge of VAD is distinguishing between noise and the human voice. In quiet environments VAD faces little issue - it is very easy to discern between silence and a speaking individual. With the addition of background noise (such as a fan or car engine), detecting voices becomes a challenging task. To aid VAD, Digital Signal Processing (DSP) algorithms help by denoising or providing the ability to inspect spectral features. However, DSP algorithms fail when the nature of noise approaches human speech, i.e. babble noise. A Deep Learning Model can learn the subtle differences between voice-like noises and human voice activity to improve VAD accuracy.
Performance Metrics
Accuracy
We measure the accuracy of a VAD using two metrics: True Positive Rate (TPR) and False Positive Rate (FPR). A Receiver
Operator Characteristics (ROC) curve depicts these two metrics in one graph. The ROC curve is a known tool for inspecting
the performance of binary classifiers across different decision thresholds. It allows the designer to study the interplay
of detection rate vs false positive rate.
The figure below compares the accuracy of the Picovoice  Cobra VAD engine against WebRTC VAD and Silero VAD. You can see that Cobra gives
a higher TPR at any given FPR, hence is more accurate.
Latency
Do you need a real-time response? If yes, offline implementations of VAD are not an option, as they need to have the
entire audio data to start processing. Streaming architectures can ingest and process audio data in chunks (frames).
Each streaming VAD has a fixed latency which varies from milliseconds to seconds.
Runtime Efficiency
For battery-powered applications, an efficient VAD extends battery life and hence improves user experience. Alternatively,
an efficient VAD implementation can process massive amounts of data much faster at scale.







