Speaker Recognition Benchmark
Speaker recognition is the timely identification of a person in an audio stream based on their voiceprint. It determines whether a specific individual is speaking at a given time. The benchmark assesses Picovoice Eagle against well-known open-source speaker recognition engines listed below:
For this benchmark, it is assumed that the enrollment step takes place offline. Subsequently, the speaker recognition engine is used to detect the enrolled speaker within a stream of audio frames. The duration of each audio frame is 96 ms.
VoxConverse is a well-known dataset used in speaker identification. It contains conversations in many languages and includes time details for speakers.
The Detection Accuracy (DA) metric is determined by the accuracy of the recognition system as a binary classification, and its computation relies on the formula:
where indicates the duration of true positives (segments correctly classified as the enrolled speaker), represents the duration of true negatives (segments accurately identified as non-enrolled speakers), and is the overall duration of the input audio signal.
Detection error rate
The Detection Error Rate (DER) metric assesses the duration of errors relative to the total duration of enrolled speaker segments:
where and denote the duration of false alarms and missed detections for enrolled speakers, and is the overall duration of enrolled speaker segments in the input audio signal.
The Core-Hour metric is used to evaluate the computational efficiency of the speaker recognition engine, indicating the number of hours required to process one hour of audio on a single CPU core.
The figures below show the average performance of each engine by calculating the average
Detection Error Rate.
The data and code used to create this benchmark are available on GitHub under the permissive Apache 2.0 license. Detailed instructions for benchmarking individual engines are provided in the following documents: