Picovoice Wordmark
Start Free
Introduction
Introduction
AndroidC.NETiOSLinuxmacOSNode.jsPythonRaspberry PiWebWindows
AndroidC.NETiOSNode.jsPythonWeb
SummaryPicovoice picoLLMGPTQ
Introduction
AndroidCiOSLinuxmacOSPythonRaspberry PiWebWindows
AndroidCiOSPythonWeb
Introduction
AndroidC.NETFlutteriOSJavaLinuxmacOSNode.jsPythonRaspberry PiReactReact NativeWebWindows
AndroidC.NETFlutteriOSJavaNode.jsPythonReactReact NativeWeb
SummaryPicovoice LeopardAmazon TranscribeAzure Speech-to-TextGoogle ASRGoogle ASR (Enhanced)IBM Watson Speech-to-TextWhisper Speech-to-Text
FAQ
Introduction
AndroidC.NETFlutteriOSJavaLinuxmacOSNode.jsPythonRaspberry PiReactReact NativeWebWindows
AndroidC.NETFlutteriOSJavaNode.jsPythonReactReact NativeWeb
SummaryPicovoice CheetahAzure Real-Time Speech-to-TextAmazon Transcribe StreamingGoogle Streaming ASRMoonshine StreamingVosk StreamingWhisper.cpp Streaming
FAQ
Introduction
AndroidC.NETiOSLinuxmacOSNode.jsPythonRaspberry PiWebWindows
AndroidC.NETiOSNode.jsPythonWeb
SummaryAmazon PollyAzure TTSElevenLabsOpenAI TTSPicovoice Orca
Introduction
AndroidCiOSLinuxmacOSPythonRaspberry PiWebWindows
AndroidCiOSPythonWeb
SummaryPicovoice KoalaMozilla RNNoise
Introduction
AndroidCiOSLinuxmacOSNode.jsPythonRaspberry PiWebWindows
AndroidCNode.jsPythoniOSWeb
SummaryPicovoice EaglepyannoteSpeechBrain
Introduction
AndroidCiOSLinuxmacOSPythonRaspberry PiWebWindows
AndroidCiOSPythonWeb
SummaryPicovoice FalconAmazon TranscribeAzure Speech-to-TextGoogle Speech-to-Textpyannote
Introduction
AndroidArduinoCChrome.NETEdgeFirefoxFlutteriOSJavaLinuxmacOSMicrocontrollerNode.jsPythonRaspberry PiReactReact NativeSafariWebWindows
AndroidC.NETFlutteriOSJavaMicrocontrollerNode.jsPythonReactReact NativeWeb
SummaryPicovoice PorcupineSnowboyPocketSphinx
Wake Word TipsFAQ
Introduction
AndroidArduinoCChrome.NETEdgeFirefoxFlutteriOSJavaLinuxmacOSMicrocontrollerNode.jsPythonRaspberry PiReactReact NativeSafariWebWindows
AndroidC.NETFlutteriOSJavaMicrocontrollerNode.jsPythonReactReact NativeWeb
SummaryPicovoice RhinoGoogle DialogflowAmazon LexIBM WatsonMicrosoft LUIS
Expression SyntaxFAQ
Introduction
AndroidArduinoC.NETiOSLinuxmacOSMicrocontrollerNode.jsPythonRaspberry PiWebWindows
AndroidC.NETiOSMicrocontrollerNode.jsPythonWeb
SummaryPicovoice CobraWebRTC VADSilero VAD
FAQ
Introduction
AndroidCiOSLinuxmacOSPythonRaspberry PiWebWindows
AndroidCiOSPythonWeb
Introduction
AndroidC.NETFlutteriOSNode.jsPythonReact NativeWeb
AndroidC.NETFlutteriOSNode.jsPythonReact NativeWeb
Introduction
C.NETNode.jsPython
C.NETNode.jsPython
FAQGlossary

Real-time Transcription Benchmark

Choosing the right real-time transcription engine requires evaluating three criteria together:

  • Accuracy: how correctly speech is transcribed
  • Latency: how quickly each word is emitted after it is spoken
  • Compute efficiency: how well a model suits on-device or edge deployment

This open-source benchmark measures each criterion using five metrics: word error rate (WER) and punctuation error rate (PER) for accuracy, word emission latency for latency, and core-hour and model size for compute efficiency. It compares Amazon Transcribe Streaming, Azure Real-Time Speech-to-Text, Google Streaming Speech-to-Text, Cheetah Streaming Speech-to-Text, Moonshine, Vosk, and Whisper.cpp, popular real-time transcription engines among enterprise developers. Across all engines tested, only Cheetah Streaming Speech-to-Text delivers on all three criteria. Every other engine fails on at least one.

Real-time transcription cloud APIs are accurate but unreliable. Amazon Transcribe Streaming , Azure Real-Time Speech-to-Text , and Google Streaming Speech-to-Text are the dominant cloud STT APIs for real-time transcription. They deliver competitive accuracy but depend on network connectivity, which introduces variable latency and a single point of failure: when the network degrades or a provider has an outage, the application fails. Cloud processing also introduces data privacy exposure, unbounded costs at scale, and rules out on-device deployment by design.

On-device real-time transcription SDKs are reliable but trade off at least one criterion. On-device processing eliminates inherent cloud limitations. However, most local engines struggle to match cloud accuracy without sacrificing latency or compute efficiency. Moonshine achieves competitive WER but at a high compute cost. Vosk is too slow to emit words. Whisper.cpp is too power-hungry and too slow for real-time use on constrained hardware.

Cheetah Streaming Speech-to-Text delivers on all. Cheetah is the only on-device real-time transcription engine that outperforms Google Streaming STT on accuracy, approaches Amazon and Azure on WER, and requires less compute than any other local engine tested — with no tradeoff on latency, privacy, or hardware requirements.

Below is a series of benchmarks to back our claims.

Speech-to-Text Benchmark Languages

  • English Speech-to-Text Benchmark
  • French Speech-to-Text Benchmark
  • German Speech-to-Text Benchmark
  • Spanish Speech-to-Text Benchmark
  • Italian Speech-to-Text Benchmark
  • Portuguese Speech-to-Text Benchmark

Speech-to-Text Benchmark Metrics

Word Error Rate (WER)

Word error rate is the ratio of edit distance between words in a reference transcript and the words in the output of the speech-to-text engine to the number of words in the reference transcript. In other words, WER is the ratio of errors in a transcript to the total words spoken. Despite its limitations, WER is the most commonly used metric to measure speech-to-text engine accuracy. A lower WER (lower number of errors) means better accuracy in recognizing speech.

Punctuation Error Rate (PER)

Punctuation Error Rate is the ratio of punctuation-specific errors between a reference transcript and the output of a speech-to-text engine to the number of punctuation-related operations in the reference transcript (more details in Section 3 of Meister et al.). A lower PER (lower number of errors) means better accuracy in punctuating speech. We report PER results for periods (.) and question marks (?).

Word Emission Latency

Word emission latency is the average delay from the point a word is finished being spoken to when its transcription is emitted by a streaming speech-to-text engine. A lower word emission latency means a more responsive experience with smaller delays between the intermediate transcriptions.

Core-hour

The Core-Hour metric is used to evaluate the computational efficiency of the speech-to-text engine, indicating the number of CPU hours required to process one hour of audio. A speech-to-text engine with a lower Core-Hour is more computationally efficient. The open-source real-time transcription benchmark omits this metric for cloud-based engines, as the data is not available.

Model Size

The aggregate size of models (acoustic and language), in MB. The open-source real-time transcription benchmark omit this metric for cloud-based engines, as the data is not available.

English Speech-to-Text Benchmark

English Speech Corpus

We use the following datasets for word accuracy benchmarks:

  • LibriSpeech test-clean
  • LibriSpeech test-other
  • Common Voice test
  • TED-LIUM test

And we use the following datasets for punctuation accuracy benchmarks:

  • Common Voice test
  • VoxPopuli test
  • Fleurs test

Results

Word Accuracy

The figure below shows the word accuracy of each engine averaged over all datasets.

Punctuation Accuracy

The figure below shows the punctuation accuracy of each engine averaged over all datasets.

Core Hour

The figure below shows the resource requirement of each engine.

Please note that we ran the benchmark across the entire LibriSpeech test-clean dataset on an Ubuntu 22.04 machine with AMD CPU (AMD Ryzen 9 5900X (12) @ 3.70GHz), 64 GB of RAM, and NVMe storage, using 10 cores simultaneously and recorded the processing time to obtain the results below. Different datasets and platforms affect the Core-Hour. However, one can expect the same ratio among engines if everything else is the same.

Word Accuracy vs Core Hour

The figure below shows the comparison between word accuracy and resource requirements each local engine.

Word Emission Latency

The figure below shows the average word emission latency of each engine. To obtain these results, we used 100 randomly selected files from the LibriSpeech test-clean dataset.

Word Accuracy vs Word Emission Latency

The figure below shows the comparison between word accuracy and average word emission latency of each local engine.

The figure below shows the comparison between Cheetah and each cloud API engine.

Model Size

The figure below shows the model size of each engine.

Word Accuracy vs Model Size

The figure below shows the comparison between word accuracy and model size of Cheetah and each local engine.

French Speech-to-Text Benchmark

French Speech Corpus

We use the following datasets for word accuracy benchmarks:

  • Multilingual LibriSpeech test
  • Common Voice test
  • VoxPopuli test

And we use the following datasets for punctuation accuracy benchmarks:

  • Common Voice test
  • VoxPopuli test
  • Fleurs test

Results

Word Accuracy

The figure below shows the word accuracy of each engine averaged over all datasets.

Punctuation Accuracy

The figure below shows the punctuation accuracy of each engine averaged over all datasets.

German Speech-to-Text Benchmark

German Speech Corpus

We use the following datasets for word accuracy benchmarks:

  • Multilingual LibriSpeech test
  • Common Voice test
  • VoxPopuli test

And we use the following datasets for punctuation accuracy benchmarks:

  • Common Voice test
  • VoxPopuli test
  • Fleurs test

Results

Word Accuracy

The figure below shows the word accuracy of each engine averaged over all datasets.

Punctuation Accuracy

The figure below shows the punctuation accuracy of each engine averaged over all datasets.

Spanish Speech-to-Text Benchmark

Spanish Speech Corpus

We use the following datasets for word accuracy benchmarks:

  • Multilingual LibriSpeech test
  • Common Voice test
  • VoxPopuli test

And we use the following datasets for punctuation accuracy benchmarks:

  • Common Voice test
  • VoxPopuli test
  • Fleurs test

Results

Word Accuracy

The figure below shows the word accuracy of each engine averaged over all datasets.

Punctuation Accuracy

The figure below shows the punctuation accuracy of each engine averaged over all datasets.

Italian Speech-to-Text Benchmark

Italian Speech Corpus

We use the following datasets for word accuracy benchmarks:

  • Multilingual LibriSpeech test
  • Common Voice test
  • VoxPopuli test

And we use the following datasets for punctuation accuracy benchmarks:

  • Common Voice test
  • VoxPopuli test
  • Fleurs test

Results

Word Accuracy

The figure below shows the word accuracy of each engine averaged over all datasets.

Punctuation Accuracy

The figure below shows the punctuation accuracy of each engine averaged over all datasets.

Portuguese Speech-to-Text Benchmark

Portuguese Speech Corpus

We use the following datasets for word accuracy benchmarks:

  • Multilingual LibriSpeech test
  • Common Voice test

And we use the following datasets for punctuation accuracy benchmarks:

  • Common Voice test
  • Fleurs test

Results

Word Accuracy

The figure below shows the word accuracy of each engine averaged over all datasets.

Punctuation Accuracy

The figure below shows the punctuation accuracy of each engine averaged over all datasets.

Usage

The data and code used to create this benchmark are available on GitHub under the permissive Apache 2.0 license. Detailed instructions for benchmarking individual engines are provided in the following documents:

  • AWS Transcribe Streaming
  • Azure Real-Time Speech-to-Text
  • Google Streaming Speech-to-Text
  • Picovoice Cheetah Streaming Speech-to-Text

Was this doc helpful?

Issue with this doc?

Report a GitHub Issue
Real-time Transcription Benchmark
  • Speech-to-Text Benchmark Metrics
  • Word Error Rate (WER)
  • Punctuation Error Rate (PER)
  • Word Emission Latency
  • Core-hour
  • Model Size
  • English Speech-to-Text Benchmark
  • French Speech-to-Text Benchmark
  • German Speech-to-Text Benchmark
  • Spanish Speech-to-Text Benchmark
  • Italian Speech-to-Text Benchmark
  • Portuguese Speech-to-Text Benchmark
Voice AI
  • picoLLM On-Device LLM
  • Leopard Speech-to-Text
  • Cheetah Streaming Speech-to-Text
  • Orca Text-to-Speech
  • Koala Noise Suppression
  • Eagle Speaker Recognition
  • Falcon Speaker Diarization
  • Porcupine Wake Word
  • Rhino Speech-to-Intent
  • Cobra Voice Activity Detection
Resources
  • Docs
  • Console
  • Blog
  • Use Cases
  • Playground
Contact
  • Contact Sales
Company
  • About us
  • Careers
Follow Picovoice
  • LinkedIn
  • GitHub
  • X
  • YouTube
  • AngelList
Subscribe to our newsletter
Terms of Use
Privacy Policy
© 2019-2026 Picovoice Inc.