Real-time Transcription Benchmark
Choosing the right real-time transcription engine requires evaluating three criteria together:
- Accuracy: how correctly speech is transcribed
- Latency: how quickly each word is emitted after it is spoken
- Compute efficiency: how well a model suits on-device or edge deployment
This open-source benchmark measures each criterion using five metrics: word error rate (WER) and punctuation error rate (PER) for accuracy, word emission latency for latency, and core-hour and model size for compute efficiency. It compares Amazon Transcribe Streaming, Azure Real-Time Speech-to-Text, Google Streaming Speech-to-Text, Cheetah Streaming Speech-to-Text, Moonshine, Vosk, and Whisper.cpp, popular real-time transcription engines among enterprise developers. Across all engines tested, only Cheetah Streaming Speech-to-Text delivers on all three criteria. Every other engine fails on at least one.
Real-time transcription cloud APIs are accurate but unreliable. Amazon Transcribe Streaming , Azure Real-Time Speech-to-Text , and Google Streaming Speech-to-Text are the dominant cloud STT APIs for real-time transcription. They deliver competitive accuracy but depend on network connectivity, which introduces variable latency and a single point of failure: when the network degrades or a provider has an outage, the application fails. Cloud processing also introduces data privacy exposure, unbounded costs at scale, and rules out on-device deployment by design.
On-device real-time transcription SDKs are reliable but trade off at least one criterion. On-device processing eliminates inherent cloud limitations. However, most local engines struggle to match cloud accuracy without sacrificing latency or compute efficiency. Moonshine achieves competitive WER but at a high compute cost. Vosk is too slow to emit words. Whisper.cpp is too power-hungry and too slow for real-time use on constrained hardware.
Cheetah Streaming Speech-to-Text delivers on all. Cheetah is the only on-device real-time transcription engine that outperforms Google Streaming STT on accuracy, approaches Amazon and Azure on WER, and requires less compute than any other local engine tested — with no tradeoff on latency, privacy, or hardware requirements.
Below is a series of benchmarks to back our claims.
Speech-to-Text Benchmark Languages
- English Speech-to-Text Benchmark
- French Speech-to-Text Benchmark
- German Speech-to-Text Benchmark
- Spanish Speech-to-Text Benchmark
- Italian Speech-to-Text Benchmark
- Portuguese Speech-to-Text Benchmark
Speech-to-Text Benchmark Metrics
Word Error Rate (WER)
Word error rate is the ratio of edit distance between words in a reference transcript and the words in the output of the speech-to-text engine to the number of words in the reference transcript. In other words, WER is the ratio of errors in a transcript to the total words spoken. Despite its limitations, WER is the most commonly used metric to measure speech-to-text engine accuracy. A lower WER (lower number of errors) means better accuracy in recognizing speech.
Punctuation Error Rate (PER)
Punctuation Error Rate is the ratio of punctuation-specific errors between a reference transcript and the output of a speech-to-text engine to the number of punctuation-related operations in the reference transcript (more details in Section 3 of Meister et al.). A lower PER (lower number of errors) means better accuracy in punctuating speech. We report PER results for periods (.) and question marks (?).
Word Emission Latency
Word emission latency is the average delay from the point a word is finished being spoken to when its transcription is emitted by a streaming speech-to-text engine. A lower word emission latency means a more responsive experience with smaller delays between the intermediate transcriptions.
Core-hour
The Core-Hour metric is used to evaluate the computational efficiency of the speech-to-text engine, indicating the number of CPU hours required to process one hour of audio. A speech-to-text engine with a lower Core-Hour is more computationally efficient. The open-source real-time transcription benchmark omits this metric for cloud-based engines, as the data is not available.
Model Size
The aggregate size of models (acoustic and language), in MB. The open-source real-time transcription benchmark omit this metric for cloud-based engines, as the data is not available.
English Speech-to-Text Benchmark
English Speech Corpus
We use the following datasets for word accuracy benchmarks:
- LibriSpeech
test-clean - LibriSpeech
test-other - Common Voice
test - TED-LIUM
test
And we use the following datasets for punctuation accuracy benchmarks:
- Common Voice
test - VoxPopuli
test - Fleurs
test
Results
Word Accuracy
The figure below shows the word accuracy of each engine averaged over all datasets.
Punctuation Accuracy
The figure below shows the punctuation accuracy of each engine averaged over all datasets.
Core Hour
The figure below shows the resource requirement of each engine.
Please note that we ran the benchmark across the entire LibriSpeech test-clean dataset on an Ubuntu 22.04 machine with AMD CPU (AMD Ryzen 9 5900X (12) @ 3.70GHz), 64 GB of RAM, and NVMe storage, using 10 cores simultaneously and recorded the processing time to obtain the results below. Different datasets and platforms affect the Core-Hour. However, one can expect the same ratio among engines if everything else is the same.
Word Accuracy vs Core Hour
The figure below shows the comparison between word accuracy and resource requirements each local engine.
Word Emission Latency
The figure below shows the average word emission latency of each engine. To obtain these results, we used 100 randomly selected files from the LibriSpeech test-clean dataset.
Word Accuracy vs Word Emission Latency
The figure below shows the comparison between word accuracy and average word emission latency of each local engine.
The figure below shows the comparison between Cheetah and each cloud API engine.
Model Size
The figure below shows the model size of each engine.
Word Accuracy vs Model Size
The figure below shows the comparison between word accuracy and model size of Cheetah and each local engine.
French Speech-to-Text Benchmark
French Speech Corpus
We use the following datasets for word accuracy benchmarks:
- Multilingual LibriSpeech
test - Common Voice
test - VoxPopuli
test
And we use the following datasets for punctuation accuracy benchmarks:
- Common Voice
test - VoxPopuli
test - Fleurs
test
Results
Word Accuracy
The figure below shows the word accuracy of each engine averaged over all datasets.
Punctuation Accuracy
The figure below shows the punctuation accuracy of each engine averaged over all datasets.
German Speech-to-Text Benchmark
German Speech Corpus
We use the following datasets for word accuracy benchmarks:
- Multilingual LibriSpeech
test - Common Voice
test - VoxPopuli
test
And we use the following datasets for punctuation accuracy benchmarks:
- Common Voice
test - VoxPopuli
test - Fleurs
test
Results
Word Accuracy
The figure below shows the word accuracy of each engine averaged over all datasets.
Punctuation Accuracy
The figure below shows the punctuation accuracy of each engine averaged over all datasets.
Spanish Speech-to-Text Benchmark
Spanish Speech Corpus
We use the following datasets for word accuracy benchmarks:
- Multilingual LibriSpeech
test - Common Voice
test - VoxPopuli
test
And we use the following datasets for punctuation accuracy benchmarks:
- Common Voice
test - VoxPopuli
test - Fleurs
test
Results
Word Accuracy
The figure below shows the word accuracy of each engine averaged over all datasets.
Punctuation Accuracy
The figure below shows the punctuation accuracy of each engine averaged over all datasets.
Italian Speech-to-Text Benchmark
Italian Speech Corpus
We use the following datasets for word accuracy benchmarks:
- Multilingual LibriSpeech
test - Common Voice
test - VoxPopuli
test
And we use the following datasets for punctuation accuracy benchmarks:
- Common Voice
test - VoxPopuli
test - Fleurs
test
Results
Word Accuracy
The figure below shows the word accuracy of each engine averaged over all datasets.
Punctuation Accuracy
The figure below shows the punctuation accuracy of each engine averaged over all datasets.
Portuguese Speech-to-Text Benchmark
Portuguese Speech Corpus
We use the following datasets for word accuracy benchmarks:
- Multilingual LibriSpeech
test - Common Voice
test
And we use the following datasets for punctuation accuracy benchmarks:
- Common Voice
test - Fleurs
test
Results
Word Accuracy
The figure below shows the word accuracy of each engine averaged over all datasets.
Punctuation Accuracy
The figure below shows the punctuation accuracy of each engine averaged over all datasets.
Usage
The data and code used to create this benchmark are available on GitHub under the permissive Apache 2.0 license. Detailed instructions for benchmarking individual engines are provided in the following documents: