Real-time Transcription Benchmark

Real-time transcription is one of the most known and used speech AI technologies. It enables several applications that require immediate visual feedback, such as dictation (voice typing), voice assistants, or closed captions of virtual events and meetings. Finding the best real-time transcription engine can be challenging. A good real-time transcription engine should convert speech to text accurately and with minimal delay.

Real-time transcription solutions achieving state-of-the-art accuracy often run in the cloud. Currently, Amazon Transcribe, Azure Speech-to-Text, Google Speech-to-Text, and IBM Watson Speech-to-Text are the dominant real-time transcription API alternatives. Running real-time transcription in the cloud requires data to be processed in remote servers, which introduces network latency and unreliable response time. Hence, cloud dependency can fail applications to provide immediate feedback - whether the issue is on the users’ or cloud providers’ side, or both.

On-device real-time transcription solutions offer reliable real time experiences by removing the inherent limitations of cloud computing, i.e., variable delay induced by network connectivity. However, running transcription locally with minimal resource requirements or without sacrificing accuracy is challenging. Thus, there are not many on-device transcription solutions. Currently, OpenAI Whisper is the most popular one with smaller model sizes such as Whisper Tiny, Whisper Small, and Whisper Base. Yet, running real-time transcription is even more challenging. OpenAI Whisper does not support real-time transcription. There is no well-known on-device real-time transcription alternative that is developer-friendly and achieves state-of-the-art accuracy.

Cheetah Streaming Speech-to-Text is an extremely efficient on-device real-time transcription engine that achieves file-based cloud transcription API accuracy by processing voice data locally and in real time. Below is a series of benchmarks to back our claims. Due to the lack of real-time transcription capabilities of Whisper models, we used file-based transcription engines of cloud providers for consistency.

Please note that Cheetah Streaming Speech-to-Text is this benchmark's only real-time transcription engine. Due to the lack of real-time transcription capabilities of Whisper models, we used file-based transcription engines of cloud providers for consistency.

Speech-to-Text Benchmark Languages

Speech-to-Text Benchmark Metrics

Word Error Rate (WER)

Word error rate is the ratio of edit distance between words in a reference transcript and the words in the output of the speech-to-text engine to the number of words in the reference transcript. In other words, WER is the ratio of errors in a transcript to the total words spoken. Despite its limitations, WER is the most commonly used metric to measure speech-to-text engine accuracy. A lower WER (lower number of errors) means better accuracy in recognizing speech.

Punctuation Error Rate (PER)

Punctuation Error Rate is the ratio of punctuation-specific errors between a reference transcript and the output of a speech-to-text engine to the number of punctuation-related operations in the reference transcript (more details in Section 3 of Meister et al.). A lower PER (lower number of errors) means better accuracy in punctuating speech. We report PER results for periods (.) and question marks (?).

Core-Hour

The Core-Hour metric is used to evaluate the computational efficiency of the speech-to-text engine, indicating the number of CPU hours required to process one hour of audio. A speech-to-text engine with a lower Core-Hour is more computationally efficient. We omit this metric for cloud-based engines.

English Speech-to-Text Benchmark

English Speech Corpus

We use the following datasets for word accuracy benchmarks:

LibriSpeech test-clean
LibriSpeech test-other
Common Voice test
TED-LIUM test

And we use the following datasets for punctuation accuracy benchmarks:

Common Voice test
VoxPopuli test
Fleurs test

Results

Word Accuracy

The figure below shows the word accuracy of each engine averaged over all datasets.

English Speech-to-Text Accuracy Comparison

Punctuation Accuracy

The figure below shows the punctuation accuracy of each engine averaged over all datasets.

Core Hour

The figure below shows the resource requirement of each engine.

Please note that we ran the benchmark across the entire TED-LIUM dataset on an Ubuntu 22.04 machine with AMD CPU (AMD Ryzen 9 5900X (12) @ 3.70GHz), 64 GB of RAM, and NVMe storage, using 10 cores simultaneously and recorded the processing time to obtain the results below. Different datasets and platforms affect the Core-Hour. However, one can expect the same ratio among engines if everything else is the same. For example, Whisper Tiny requires 3x more resources or takes 3x more time compared to Picovoice Leopard.

French Speech-to-Text Benchmark

French Speech Corpus

We use the following datasets for word accuracy benchmarks:

And we use the following datasets for punctuation accuracy benchmarks:

Common Voice test
VoxPopuli test
Fleurs test

Results

Word Accuracy

The figure below shows the word accuracy of each engine averaged over all datasets.

French Speech-to-Text Accuracy Comparison

Punctuation Accuracy

The figure below shows the punctuation accuracy of each engine averaged over all datasets.

German Speech-to-Text Benchmark

German Speech Corpus

We use the following datasets for word accuracy benchmarks:

And we use the following datasets for punctuation accuracy benchmarks:

Common Voice test
VoxPopuli test
Fleurs test

Results

Word Accuracy

The figure below shows the word accuracy of each engine averaged over all datasets.

German Speech-to-Text Accuracy Comparison

Punctuation Accuracy

The figure below shows the punctuation accuracy of each engine averaged over all datasets.

Spanish Speech-to-Text Benchmark

Spanish Speech Corpus

We use the following datasets for word accuracy benchmarks:

And we use the following datasets for punctuation accuracy benchmarks:

Common Voice test
VoxPopuli test
Fleurs test

Results

Word Accuracy

The figure below shows the word accuracy of each engine averaged over all datasets.

Spanish Speech-to-Text Accuracy Comparison

Punctuation Accuracy

The figure below shows the punctuation accuracy of each engine averaged over all datasets.

Italian Speech-to-Text Benchmark

Italian Speech Corpus

We use the following datasets for word accuracy benchmarks:

And we use the following datasets for punctuation accuracy benchmarks:

Common Voice test
VoxPopuli test
Fleurs test

Results

Word Accuracy

The figure below shows the word accuracy of each engine averaged over all datasets.

Italian Speech-to-Text Accuracy Comparison

Punctuation Accuracy

The figure below shows the punctuation accuracy of each engine averaged over all datasets.

Portuguese Speech-to-Text Benchmark

Portuguese Speech Corpus

We use the following datasets for word accuracy benchmarks:

Multilingual LibriSpeech test
Common Voice test

And we use the following datasets for punctuation accuracy benchmarks:

Common Voice test
Fleurs test

Results

Word Accuracy

The figure below shows the word accuracy of each engine averaged over all datasets.

Portuguese Speech-to-Text Accuracy Comparison

Punctuation Accuracy

The figure below shows the punctuation accuracy of each engine averaged over all datasets.

Usage

The data and code used to create this benchmark are available on GitHub under the permissive Apache 2.0 license. Detailed instructions for benchmarking individual engines are provided in the following documents:

Was this doc helpful?

Issue with this doc?