Speaker Diarization Benchmark

Speaker diarization involves labeling audio with speaker identities, often used alongside a speech-to-text (STT) engine to transcribe audio while assigning speaker labels. This benchmark evaluates the performance of Picovoice Falcon in comparison to well-known cloud-based STT engines and specialized speaker diarization engines listed below:

Methodology

Speech Corpus

VoxConverse is a widely recognized dataset used for diarization purposes, containing conversations among speakers in multiple languages. In this benchmark, we employ cloud-based Speech-to-Text engines that come with speaker diarization capabilities. Therefore, for benchmarking, we specifically use the English subset found within the test section of the dataset.

Metrics

Diarization Error Rate (DER)

The Diarization Error Rate (DER) is the most common metric for evaluating speaker diarization systems. DER is calculated by summing the time duration of three distinct errors: speaker confusion, false alarms, and missed detections. This total duration is then divided by the overall time span.

Jaccard Error Rate (JER)

The Jaccard Error Rate (JER) is a newly developed metric for evaluating speaker diarization, specifically designed for DIHARD II. It is based on the Jaccard similarity index, which measures the similarity between two sets of segments. In short, JER assigns equal weight to each speaker's contribution, regardless of their speech duration. For a more in-depth understanding, refer to the second DIHARD's paper.

Total Memory Usage

This metric provides insight into the memory consumption of the diarization engine during its processing of audio files. It presents the total memory utilized, measured in gigabytes (GB).

Core-Hour

The Core-Hour metric is used to evaluate the computational efficiency of the diarization engine. This metric indicates how much audio can be processed in an hour using a single CPU core.

"Total Memory Usage" and "Core-Hour" are not applicable to cloud-based engines. All measurements are carried out on a machine with AMD CPU (`AMD Ryzen 7 5700X (16) @ 3.400G`), 64 GB of RAM, and NVMe storage.

Results

Accuracy

The figures below show the average performance of each engine by calculating the average DER and JER.

Resource Utilization

Usage

The data and code used to create this benchmark are available on GitHub under the permissive Apache 2.0 license. Detailed instructions for benchmarking individual engines are provided in the following documents:

Was this doc helpful?

Issue with this doc?