Speech-to-Text Benchmark

Automatic speech recognition (ASR) is the core building block of most voice applications to the point that practitioners use speech-to-text (STT) and speech recognition interchangeably. ASR systems achieving state-of-the-art accuracy often run in the cloud. Amazon Transcribe, Azure Speech-to-Text, Google Speech-to-Text, and IBM Watson Speech-to-Text for the current dominant transcription API providers.

STT’s reliance on the cloud makes it costly, less reliable, and laggy. On-device ASRs can be orders of magnitude more cost-effective than API counterparts. Additionally, offline ASRs are inherently reliable and real-time by removing the variable delay induced by network connectivity. Running an ASR engine offline without sacrificing accuracy is challenging. Common approaches to audio transcription involve massive graphs for language modelling and compute-intensive neural networks for acoustic modelling. Picovoice’s Leopard speech-to-text engine takes a different approach to achieve cloud-level accuracy while running offline on commodity hardware like a Raspberry Pi.

Below is a series of benchmarks to back our claims. They also empower customers to make data-driven decisions using the datasets that matter to their business.

The real-time transcription benchmark is also available if you’re interested in evaluating the performance of Cheetah Streaming Speech-to-Text.

Speech-to-Text Benchmark Languages

Speech-to-Text Benchmark Metrics

Word Error Rate (WER)

Word error rate is the ratio of edit distance between words in a reference transcript and the words in the output of the speech-to-text engine to the number of words in the reference transcript. In other words, WER is the ratio of errors in a transcript to the total words spoken. Despite its limitations, WER is the most commonly used metric to measure speech-to-text engine accuracy. A lower WER (lower number of errors) means better accuracy in recognizing speech.

Core-Hour

The Core-Hour metric is used to evaluate the computational efficiency of the speech-to-text engine, indicating the number of CPU hours required to process one hour of audio. A speech-to-text engine with a lower Core-Hour is more computationally efficient. We omit this metric for cloud-based engines.

English Speech-to-Text Benchmark

English Speech Corpus

We use the following datasets for benchmarks:

LibriSpeech test-clean
LibriSpeech test-other
Common Voice test
TED-LIUM test

Results

Accuracy

The figure below shows the accuracy of each engine averaged over all datasets.

Core Hour

The figure below shows the resource requirement of each engine.

Please note that we ran the benchmark across the entire TED-LIUM dataset on an Ubuntu 22.04 machine with AMD CPU (AMD Ryzen 9 5900X (12) @ 3.70GHz), 64 GB of RAM, and NVMe storage, using 10 cores simultaneously and recorded the processing time to obtain the results below. Different datasets and platforms affect the Core-Hour. However, one can expect the same ratio among engines if everything else is the same. For example, Whisper Tiny requires 3x more resources or takes 3x more time compared to Picovoice Leopard.

French Speech-to-Text Benchmark

French Speech Corpus

We use the following datasets for benchmarks:

Results

Accuracy

The figure below shows the accuracy of each engine averaged over all datasets.

German Speech-to-Text Benchmark

German Speech Corpus

We use the following datasets for benchmarks:

Results

Accuracy

The figure below shows the accuracy of each engine averaged over all datasets.

Spanish Speech-to-Text Benchmark

Spanish Speech Corpus

We use the following datasets for benchmarks:

Results

Accuracy

The figure below shows the accuracy of each engine averaged over all datasets.

Italian Speech-to-Text Benchmark

Italian Speech Corpus

We use the following datasets for benchmarks:

Results

Accuracy

The figure below shows the accuracy of each engine averaged over all datasets.

Portuguese Speech-to-Text Benchmark

Portuguese Speech Corpus

We use the following datasets for benchmarks:

Multilingual LibriSpeech test
Common Voice test

Results

Accuracy

The figure below shows the accuracy of each engine averaged over all datasets.

Usage

The data and code used to create this benchmark are available on GitHub under the permissive Apache 2.0 license. Detailed instructions for benchmarking individual engines are provided in the following documents:

Was this doc helpful?

Issue with this doc?