Real-time Transcription Benchmark

Choosing the right real-time transcription engine requires evaluating three criteria together:

Accuracy: how correctly speech is transcribed
Latency: how quickly each word is emitted after it is spoken
Compute efficiency: how well a model suits on-device or edge deployment

This open-source benchmark measures each criterion using five metrics: word error rate (WER) and punctuation error rate (PER) for accuracy, word emission latency for latency, and core-hour and model size for compute efficiency. It compares Amazon Transcribe Streaming, Azure Real-Time Speech-to-Text, Google Streaming Speech-to-Text, Cheetah Streaming Speech-to-Text, Moonshine, Vosk, and Whisper.cpp, popular real-time transcription engines among enterprise developers. Across all engines tested, only Cheetah Streaming Speech-to-Text delivers on all three criteria. Every other engine fails on at least one.

Real-time transcription cloud APIs are accurate but unreliable. Amazon Transcribe Streaming , Azure Real-Time Speech-to-Text , and Google Streaming Speech-to-Text are the dominant cloud STT APIs for real-time transcription. They deliver competitive accuracy but depend on network connectivity, which introduces variable latency and a single point of failure: when the network degrades or a provider has an outage, the application fails. Cloud processing also introduces data privacy exposure, unbounded costs at scale, and rules out on-device deployment by design.

On-device real-time transcription SDKs are reliable but trade off at least one criterion. On-device processing eliminates inherent cloud limitations. However, most local engines struggle to match cloud accuracy without sacrificing latency or compute efficiency. Moonshine achieves competitive WER but at a high compute cost. Vosk is too slow to emit words. Whisper.cpp is too power-hungry and too slow for real-time use on constrained hardware.

Cheetah Streaming Speech-to-Text delivers on all. Cheetah is the only on-device real-time transcription engine that outperforms Google Streaming STT on accuracy, approaches Amazon and Azure on WER, and requires less compute than any other local engine tested — with no tradeoff on latency, privacy, or hardware requirements.

Below is a series of benchmarks to back our claims.

Speech-to-Text Benchmark Languages

Speech-to-Text Benchmark Metrics

Word Error Rate (WER)

Word error rate is the ratio of edit distance between words in a reference transcript and the words in the output of the speech-to-text engine to the number of words in the reference transcript. In other words, WER is the ratio of errors in a transcript to the total words spoken. Despite its limitations, WER is the most commonly used metric to measure speech-to-text engine accuracy. A lower WER (lower number of errors) means better accuracy in recognizing speech.

Punctuation Error Rate (PER)

Punctuation Error Rate is the ratio of punctuation-specific errors between a reference transcript and the output of a speech-to-text engine to the number of punctuation-related operations in the reference transcript (more details in Section 3 of Meister et al.). A lower PER (lower number of errors) means better accuracy in punctuating speech. We report PER results for periods (.) and question marks (?).

Word Emission Latency

Word emission latency is the average delay from the point a word is finished being spoken to when its transcription is emitted by a streaming speech-to-text engine. A lower word emission latency means a more responsive experience with smaller delays between the intermediate transcriptions.

Core-hour

The Core-Hour metric is used to evaluate the computational efficiency of the speech-to-text engine, indicating the number of CPU hours required to process one hour of audio. A speech-to-text engine with a lower Core-Hour is more computationally efficient. The open-source real-time transcription benchmark omits this metric for cloud-based engines, as the data is not available.

Model Size

The aggregate size of models (acoustic and language), in MB. The open-source real-time transcription benchmark omit this metric for cloud-based engines, as the data is not available.

English Speech-to-Text Benchmark

English Speech Corpus

We use the following datasets for word accuracy benchmarks:

LibriSpeech test-clean
LibriSpeech test-other
Common Voice test
TED-LIUM test

And we use the following datasets for punctuation accuracy benchmarks:

Common Voice test
VoxPopuli test
Fleurs test

Results

Word Accuracy

The figure below shows the word accuracy of each engine averaged over all datasets.

Punctuation Accuracy

The figure below shows the punctuation accuracy of each engine averaged over all datasets.

Core Hour

The figure below shows the resource requirement of each engine.

Please note that we ran the benchmark across the entire LibriSpeech test-clean dataset on an Ubuntu 22.04 machine with AMD CPU (AMD Ryzen 9 5900X (12) @ 3.70GHz), 64 GB of RAM, and NVMe storage, using 10 cores simultaneously and recorded the processing time to obtain the results below. Different datasets and platforms affect the Core-Hour. However, one can expect the same ratio among engines if everything else is the same.

Word Accuracy vs Core Hour

The figure below shows the comparison between word accuracy and resource requirements each local engine.

Word Emission Latency

The figure below shows the average word emission latency of each engine. To obtain these results, we used 100 randomly selected files from the LibriSpeech test-clean dataset.

Word Accuracy vs Word Emission Latency

The figure below shows the comparison between word accuracy and average word emission latency of each local engine.

The figure below shows the comparison between Cheetah and each cloud API engine.

Model Size

The figure below shows the model size of each engine.

Word Accuracy vs Model Size

The figure below shows the comparison between word accuracy and model size of Cheetah and each local engine.

French Speech-to-Text Benchmark

French Speech Corpus

We use the following datasets for word accuracy benchmarks:

And we use the following datasets for punctuation accuracy benchmarks:

Common Voice test
VoxPopuli test
Fleurs test

Results

Word Accuracy

The figure below shows the word accuracy of each engine averaged over all datasets.

Punctuation Accuracy

The figure below shows the punctuation accuracy of each engine averaged over all datasets.

German Speech-to-Text Benchmark

German Speech Corpus

We use the following datasets for word accuracy benchmarks:

And we use the following datasets for punctuation accuracy benchmarks:

Common Voice test
VoxPopuli test
Fleurs test

Results

Word Accuracy

The figure below shows the word accuracy of each engine averaged over all datasets.

Punctuation Accuracy

The figure below shows the punctuation accuracy of each engine averaged over all datasets.

Spanish Speech-to-Text Benchmark

Spanish Speech Corpus

We use the following datasets for word accuracy benchmarks:

And we use the following datasets for punctuation accuracy benchmarks:

Common Voice test
VoxPopuli test
Fleurs test

Results

Word Accuracy

The figure below shows the word accuracy of each engine averaged over all datasets.

Punctuation Accuracy

The figure below shows the punctuation accuracy of each engine averaged over all datasets.

Italian Speech-to-Text Benchmark

Italian Speech Corpus

We use the following datasets for word accuracy benchmarks:

And we use the following datasets for punctuation accuracy benchmarks:

Common Voice test
VoxPopuli test
Fleurs test

Results

Word Accuracy

The figure below shows the word accuracy of each engine averaged over all datasets.

Punctuation Accuracy

The figure below shows the punctuation accuracy of each engine averaged over all datasets.

Portuguese Speech-to-Text Benchmark

Portuguese Speech Corpus

We use the following datasets for word accuracy benchmarks:

Multilingual LibriSpeech test
Common Voice test

And we use the following datasets for punctuation accuracy benchmarks:

Common Voice test
Fleurs test

Results

Word Accuracy

The figure below shows the word accuracy of each engine averaged over all datasets.

Punctuation Accuracy

The figure below shows the punctuation accuracy of each engine averaged over all datasets.

Usage

The data and code used to create this benchmark are available on GitHub under the permissive Apache 2.0 license. Detailed instructions for benchmarking individual engines are provided in the following documents:

Was this doc helpful?

Issue with this doc?