Real-time Transcription Benchmark
Real-time transcription is one of the most known and used speech AI technologies. It enables several applications that require immediate visual feedback, such as dictation (voice typing), voice assistants, or closed captions of virtual events and meetings. Finding the best real-time transcription engine can be challenging. A good real-time transcription engine should convert speech to text accurately and with minimal delay.
Real-time transcription solutions achieving state-of-the-art accuracy often run in the cloud. Currently, Amazon Transcribe Streaming, Azure Real-Time Speech-to-Text, and Google Streaming Speech-to-Text are the dominant real-time transcription Cloud API alternatives. Running real-time transcription in the cloud requires data to be processed in remote servers, which introduces network latency and unreliable response time. Hence, cloud dependency can fail applications to provide immediate feedback - whether the issue is on the users’ or cloud providers’ side, or both.
On-device real-time transcription solutions offer reliable real time experiences by removing the inherent limitations of cloud computing, i.e., variable delay induced by network connectivity. However, running transcription locally with minimal resource requirements or without sacrificing accuracy is challenging. Cheetah Streaming Speech-to-Text is a highly efficient on-device real-time transcription engine that matches the accuracy of cloud-based real-time speech-to-text APIs while processing voice data locally. Cheetah Fast offers ultra-low latency real-time transcription with minimal accuracy trade-offs, making it perfect for live streaming and interactive voice applications. Below is a series of benchmarks to back our claims.
Speech-to-Text Benchmark Languages
- English Speech-to-Text Benchmark
- French Speech-to-Text Benchmark
- German Speech-to-Text Benchmark
- Spanish Speech-to-Text Benchmark
- Italian Speech-to-Text Benchmark
- Portuguese Speech-to-Text Benchmark
Speech-to-Text Benchmark Metrics
Word Error Rate (WER)
Word error rate is the ratio of edit distance between words in a reference transcript and the words in the output of the speech-to-text engine to the number of words in the reference transcript. In other words, WER is the ratio of errors in a transcript to the total words spoken. Despite its limitations, WER is the most commonly used metric to measure speech-to-text engine accuracy. A lower WER (lower number of errors) means better accuracy in recognizing speech.
Punctuation Error Rate (PER)
Punctuation Error Rate is the ratio of punctuation-specific errors between a reference transcript and the output of a speech-to-text engine to the number of punctuation-related operations in the reference transcript (more details in Section 3 of Meister et al.). A lower PER (lower number of errors) means better accuracy in punctuating speech. We report PER results for periods (.) and question marks (?).
Word Emission Latency
Word emission latency is the average delay from the point a word is finished being spoken to when its transcription is emitted by a streaming speech-to-text engine. A lower word emission latency means a more responsive experience with smaller delays between the intermediate transcriptions.
Core-Hour
The Core-Hour metric is used to evaluate the computational efficiency of the speech-to-text engine, indicating the number of CPU hours required to process one hour of audio. A speech-to-text engine with a lower Core-Hour is more computationally efficient. The open-source real-time transcription benchmark omits this metric for cloud-based engines, as the data is not available, and uses OpenAI Whisper, despite its lack of streaming capability, as there is no other well-known, SOTA on-device streaming speech-to-text alternative to compare with Cheetah.
English Speech-to-Text Benchmark
English Speech Corpus
We use the following datasets for word accuracy benchmarks:
- LibriSpeech
test-clean - LibriSpeech
test-other - Common Voice
test - TED-LIUM
test
And we use the following datasets for punctuation accuracy benchmarks:
- Common Voice
test - VoxPopuli
test - Fleurs
test
Results
Word Accuracy
The figure below shows the word accuracy of each engine averaged over all datasets.
Punctuation Accuracy
The figure below shows the punctuation accuracy of each engine averaged over all datasets.
Core Hour
The figure below shows the resource requirement of each engine.
Please note that we ran the benchmark across the entire LibriSpeech test-clean dataset on an Ubuntu 22.04 machine with AMD CPU (AMD Ryzen 9 5900X (12) @ 3.70GHz), 64 GB of RAM, and NVMe storage, using 10 cores simultaneously and recorded the processing time to obtain the results below. Different datasets and platforms affect the Core-Hour. However, one can expect the same ratio among engines if everything else is the same. For example, Whisper Tiny requires 3x more resources or takes 3x more time compared to Picovoice Leopard.
Word Emission Latency
The figure below shows the average word emission latency of each engine.
To obtain these results, we used 100 randomly selected files from the LibriSpeech test-clean dataset.
Word Accuracy vs Word Emission Latency
The figure below shows the comparison between word accuracy and average word emission latency of each engine.
French Speech-to-Text Benchmark
French Speech Corpus
We use the following datasets for word accuracy benchmarks:
- Multilingual LibriSpeech
test - Common Voice
test - VoxPopuli
test
And we use the following datasets for punctuation accuracy benchmarks:
- Common Voice
test - VoxPopuli
test - Fleurs
test
Results
Word Accuracy
The figure below shows the word accuracy of each engine averaged over all datasets.
Punctuation Accuracy
The figure below shows the punctuation accuracy of each engine averaged over all datasets.
German Speech-to-Text Benchmark
German Speech Corpus
We use the following datasets for word accuracy benchmarks:
- Multilingual LibriSpeech
test - Common Voice
test - VoxPopuli
test
And we use the following datasets for punctuation accuracy benchmarks:
- Common Voice
test - VoxPopuli
test - Fleurs
test
Results
Word Accuracy
The figure below shows the word accuracy of each engine averaged over all datasets.
Punctuation Accuracy
The figure below shows the punctuation accuracy of each engine averaged over all datasets.
Spanish Speech-to-Text Benchmark
Spanish Speech Corpus
We use the following datasets for word accuracy benchmarks:
- Multilingual LibriSpeech
test - Common Voice
test - VoxPopuli
test
And we use the following datasets for punctuation accuracy benchmarks:
- Common Voice
test - VoxPopuli
test - Fleurs
test
Results
Word Accuracy
The figure below shows the word accuracy of each engine averaged over all datasets.
Punctuation Accuracy
The figure below shows the punctuation accuracy of each engine averaged over all datasets.
Italian Speech-to-Text Benchmark
Italian Speech Corpus
We use the following datasets for word accuracy benchmarks:
- Multilingual LibriSpeech
test - Common Voice
test - VoxPopuli
test
And we use the following datasets for punctuation accuracy benchmarks:
- Common Voice
test - VoxPopuli
test - Fleurs
test
Results
Word Accuracy
The figure below shows the word accuracy of each engine averaged over all datasets.
Punctuation Accuracy
The figure below shows the punctuation accuracy of each engine averaged over all datasets.
Portuguese Speech-to-Text Benchmark
Portuguese Speech Corpus
We use the following datasets for word accuracy benchmarks:
- Multilingual LibriSpeech
test - Common Voice
test
And we use the following datasets for punctuation accuracy benchmarks:
- Common Voice
test - Fleurs
test
Results
Word Accuracy
The figure below shows the word accuracy of each engine averaged over all datasets.
Punctuation Accuracy
The figure below shows the punctuation accuracy of each engine averaged over all datasets.
Usage
The data and code used to create this benchmark are available on GitHub under the permissive Apache 2.0 license. Detailed instructions for benchmarking individual engines are provided in the following documents: