Speaker Recognition Benchmark

Speaker recognition is the timely identification of a person in an audio stream based on their voiceprint. It determines whether a specific individual is speaking at a given time. The benchmark assesses Picovoice Eagle against well-known open-source speaker recognition engines listed below:

Methodology

For this benchmark, it is assumed that during the enrollment step access to the entire enrollment audio is available. Then, the enrolled speaker is detected within a stream of audio using the speaker recognition engine.

Metrics

Equal Error Rate

The Equal Error Rate (EER) metric is determined by the accuracy of the recognition system as a binary classification, and its computation relies on the formula:

The equal error rate (EER) is when the false acceptance rate (FAR) and false rejection rate (FRR) are equal. When these rates are equal, the common value is termed as equal error rate, given by:

where and are equal.

Model Size

The size of the model on init is used to evaluate the memory consumption of the speaker recognition engine, indicating the minimum amount of ram required to use the engine.

All measurements are carried out on a machine with AMD CPU (`AMD Ryzen 7 5700X (16) @ 3.400G`), 64 GB of RAM, and NVMe storage.

Results

Accuracy

The figure below show the average performance of each engine by calculating the average Equal Error Rate.

Resource Utilization

Usage

The data and code used to create this benchmark are available on GitHub under the permissive Apache 2.0 license. Detailed instructions for benchmarking individual engines are provided in the following documents:

Was this doc helpful?

Issue with this doc?