Picovoice Wordmark
Start Building
Introduction
Introduction
AndroidC.NETiOSLinuxmacOSNode.jsPythonRaspberry PiWebWindows
AndroidC.NETiOSNode.jsPythonWeb
SummaryPicovoice picoLLMGPTQ
Introduction
AndroidC.NETFlutteriOSJavaLinuxmacOSNode.jsPythonRaspberry PiReactReact NativeWebWindows
AndroidC.NETFlutteriOSJavaNode.jsPythonReactReact NativeWeb
SummaryPicovoice LeopardAmazon TranscribeAzure Speech-to-TextGoogle ASRGoogle ASR (Enhanced)IBM Watson Speech-to-TextWhisper Speech-to-Text
FAQ
Introduction
AndroidC.NETFlutteriOSJavaLinuxmacOSNode.jsPythonRaspberry PiReactReact NativeWebWindows
AndroidC.NETFlutteriOSJavaNode.jsPythonReactReact NativeWeb
SummaryPicovoice CheetahAzure Real-Time Speech-to-TextAmazon Transcribe StreamingGoogle Streaming ASR
FAQ
Introduction
AndroidC.NETiOSLinuxmacOSNode.jsPythonRaspberry PiWebWindows
AndroidC.NETiOSNode.jsPythonWeb
SummaryAmazon PollyAzure TTSElevenLabsOpenAI TTSPicovoice Orca
Introduction
AndroidCiOSLinuxmacOSPythonRaspberry PiWebWindows
AndroidCiOSPythonWeb
SummaryPicovoice KoalaMozilla RNNoise
Introduction
AndroidCiOSLinuxmacOSNode.jsPythonRaspberry PiWebWindows
AndroidCNode.jsPythoniOSWeb
SummaryPicovoice EaglepyannoteSpeechBrainWeSpeaker
Introduction
AndroidCiOSLinuxmacOSPythonRaspberry PiWebWindows
AndroidCiOSPythonWeb
SummaryPicovoice FalconAmazon TranscribeAzure Speech-to-TextGoogle Speech-to-Textpyannote
Introduction
AndroidArduinoCChrome.NETEdgeFirefoxFlutteriOSJavaLinuxmacOSMicrocontrollerNode.jsPythonRaspberry PiReactReact NativeSafariUnityWebWindows
AndroidC.NETFlutteriOSJavaMicrocontrollerNode.jsPythonReactReact NativeUnityWeb
SummaryPicovoice PorcupineSnowboyPocketSphinx
Wake Word TipsFAQ
Introduction
AndroidCChrome.NETEdgeFirefoxFlutteriOSJavaLinuxmacOSNode.jsPythonRaspberry PiReactReact NativeSafariUnityWebWindows
AndroidC.NETFlutteriOSJavaNode.jsPythonReactReact NativeUnityWeb
SummaryPicovoice RhinoGoogle DialogflowAmazon LexIBM WatsonMicrosoft LUIS
Expression SyntaxFAQ
Introduction
AndroidC.NETiOSLinuxmacOSNode.jsPythonRaspberry PiWebWindows
AndroidC.NETiOSNode.jsPythonWeb
SummaryPicovoice CobraWebRTC VADSilero VAD
FAQ
Introduction
AndroidC.NETFlutteriOSNode.jsPythonReact NativeUnityWeb
AndroidC.NETFlutteriOSNode.jsPythonReact NativeUnityWeb
Introduction
C.NETNode.jsPython
C.NETNode.jsPython
FAQGlossary

Real-time Transcription Benchmark

Real-time transcription is one of the most known and used speech AI technologies. It enables several applications that require immediate visual feedback, such as dictation (voice typing), voice assistants, or closed captions of virtual events and meetings. Finding the best real-time transcription engine can be challenging. A good real-time transcription engine should convert speech to text accurately and with minimal delay.

Real-time transcription solutions achieving state-of-the-art accuracy often run in the cloud. Currently, Amazon Transcribe Streaming, Azure Real-Time Speech-to-Text, and Google Streaming Speech-to-Text are the dominant real-time transcription Cloud API alternatives. Running real-time transcription in the cloud requires data to be processed in remote servers, which introduces network latency and unreliable response time. Hence, cloud dependency can fail applications to provide immediate feedback - whether the issue is on the users’ or cloud providers’ side, or both.

On-device real-time transcription solutions offer reliable real time experiences by removing the inherent limitations of cloud computing, i.e., variable delay induced by network connectivity. However, running transcription locally with minimal resource requirements or without sacrificing accuracy is challenging. Cheetah Streaming Speech-to-Text is a highly efficient on-device real-time transcription engine that matches the accuracy of cloud-based real-time speech-to-text APIs while processing voice data locally. Cheetah Fast offers ultra-low latency real-time transcription with minimal accuracy trade-offs, making it perfect for live streaming and interactive voice applications. Below is a series of benchmarks to back our claims.

Speech-to-Text Benchmark Languages

  • English Speech-to-Text Benchmark
  • French Speech-to-Text Benchmark
  • German Speech-to-Text Benchmark
  • Spanish Speech-to-Text Benchmark
  • Italian Speech-to-Text Benchmark
  • Portuguese Speech-to-Text Benchmark

Speech-to-Text Benchmark Metrics

Word Error Rate (WER)

Word error rate is the ratio of edit distance between words in a reference transcript and the words in the output of the speech-to-text engine to the number of words in the reference transcript. In other words, WER is the ratio of errors in a transcript to the total words spoken. Despite its limitations, WER is the most commonly used metric to measure speech-to-text engine accuracy. A lower WER (lower number of errors) means better accuracy in recognizing speech.

Punctuation Error Rate (PER)

Punctuation Error Rate is the ratio of punctuation-specific errors between a reference transcript and the output of a speech-to-text engine to the number of punctuation-related operations in the reference transcript (more details in Section 3 of Meister et al.). A lower PER (lower number of errors) means better accuracy in punctuating speech. We report PER results for periods (.) and question marks (?).

Word Emission Latency

Word emission latency is the average delay from the point a word is finished being spoken to when its transcription is emitted by a streaming speech-to-text engine. A lower word emission latency means a more responsive experience with smaller delays between the intermediate transcriptions.

Core-Hour

The Core-Hour metric is used to evaluate the computational efficiency of the speech-to-text engine, indicating the number of CPU hours required to process one hour of audio. A speech-to-text engine with a lower Core-Hour is more computationally efficient. The open-source real-time transcription benchmark omits this metric for cloud-based engines, as the data is not available, and uses OpenAI Whisper, despite its lack of streaming capability, as there is no other well-known, SOTA on-device streaming speech-to-text alternative to compare with Cheetah.

English Speech-to-Text Benchmark

English Speech Corpus

We use the following datasets for word accuracy benchmarks:

  • LibriSpeech test-clean
  • LibriSpeech test-other
  • Common Voice test
  • TED-LIUM test

And we use the following datasets for punctuation accuracy benchmarks:

  • Common Voice test
  • VoxPopuli test
  • Fleurs test

Results

Word Accuracy

The figure below shows the word accuracy of each engine averaged over all datasets.

Punctuation Accuracy

The figure below shows the punctuation accuracy of each engine averaged over all datasets.

Core Hour

The figure below shows the resource requirement of each engine.

Please note that we ran the benchmark across the entire LibriSpeech test-clean dataset on an Ubuntu 22.04 machine with AMD CPU (AMD Ryzen 9 5900X (12) @ 3.70GHz), 64 GB of RAM, and NVMe storage, using 10 cores simultaneously and recorded the processing time to obtain the results below. Different datasets and platforms affect the Core-Hour. However, one can expect the same ratio among engines if everything else is the same. For example, Whisper Tiny requires 3x more resources or takes 3x more time compared to Picovoice Leopard.

Word Emission Latency

The figure below shows the average word emission latency of each engine.

To obtain these results, we used 100 randomly selected files from the LibriSpeech test-clean dataset.

Word Accuracy vs Word Emission Latency

The figure below shows the comparison between word accuracy and average word emission latency of each engine.

French Speech-to-Text Benchmark

French Speech Corpus

We use the following datasets for word accuracy benchmarks:

  • Multilingual LibriSpeech test
  • Common Voice test
  • VoxPopuli test

And we use the following datasets for punctuation accuracy benchmarks:

  • Common Voice test
  • VoxPopuli test
  • Fleurs test

Results

Word Accuracy

The figure below shows the word accuracy of each engine averaged over all datasets.

Punctuation Accuracy

The figure below shows the punctuation accuracy of each engine averaged over all datasets.

German Speech-to-Text Benchmark

German Speech Corpus

We use the following datasets for word accuracy benchmarks:

  • Multilingual LibriSpeech test
  • Common Voice test
  • VoxPopuli test

And we use the following datasets for punctuation accuracy benchmarks:

  • Common Voice test
  • VoxPopuli test
  • Fleurs test

Results

Word Accuracy

The figure below shows the word accuracy of each engine averaged over all datasets.

Punctuation Accuracy

The figure below shows the punctuation accuracy of each engine averaged over all datasets.

Spanish Speech-to-Text Benchmark

Spanish Speech Corpus

We use the following datasets for word accuracy benchmarks:

  • Multilingual LibriSpeech test
  • Common Voice test
  • VoxPopuli test

And we use the following datasets for punctuation accuracy benchmarks:

  • Common Voice test
  • VoxPopuli test
  • Fleurs test

Results

Word Accuracy

The figure below shows the word accuracy of each engine averaged over all datasets.

Punctuation Accuracy

The figure below shows the punctuation accuracy of each engine averaged over all datasets.

Italian Speech-to-Text Benchmark

Italian Speech Corpus

We use the following datasets for word accuracy benchmarks:

  • Multilingual LibriSpeech test
  • Common Voice test
  • VoxPopuli test

And we use the following datasets for punctuation accuracy benchmarks:

  • Common Voice test
  • VoxPopuli test
  • Fleurs test

Results

Word Accuracy

The figure below shows the word accuracy of each engine averaged over all datasets.

Punctuation Accuracy

The figure below shows the punctuation accuracy of each engine averaged over all datasets.

Portuguese Speech-to-Text Benchmark

Portuguese Speech Corpus

We use the following datasets for word accuracy benchmarks:

  • Multilingual LibriSpeech test
  • Common Voice test

And we use the following datasets for punctuation accuracy benchmarks:

  • Common Voice test
  • Fleurs test

Results

Word Accuracy

The figure below shows the word accuracy of each engine averaged over all datasets.

Punctuation Accuracy

The figure below shows the punctuation accuracy of each engine averaged over all datasets.

Usage

The data and code used to create this benchmark are available on GitHub under the permissive Apache 2.0 license. Detailed instructions for benchmarking individual engines are provided in the following documents:

  • AWS Transcribe Streaming
  • Azure Real-Time Speech-to-Text
  • Google Streaming Speech-to-Text
  • Picovoice Cheetah Streaming Speech-to-Text

Was this doc helpful?

Issue with this doc?

Report a GitHub Issue
Real-time Transcription Benchmark
  • English Speech-to-Text Benchmark
  • French Speech-to-Text Benchmark
  • German Speech-to-Text Benchmark
  • Spanish Speech-to-Text Benchmark
  • Italian Speech-to-Text Benchmark
  • Portuguese Speech-to-Text Benchmark
Voice AI
  • picoLLM On-Device LLM
  • Leopard Speech-to-Text
  • Cheetah Streaming Speech-to-Text
  • Orca Text-to-Speech
  • Koala Noise Suppression
  • Eagle Speaker Recognition
  • Falcon Speaker Diarization
  • Porcupine Wake Word
  • Rhino Speech-to-Intent
  • Cobra Voice Activity Detection
Resources
  • Docs
  • Console
  • Blog
  • Use Cases
  • Playground
Sales & Services
  • Consulting
  • Foundation Plan
  • Enterprise Plan
  • Enterprise Support
Company
  • About us
  • Careers
Follow Picovoice
  • LinkedIn
  • GitHub
  • X
  • YouTube
  • AngelList
Subscribe to our newsletter
Terms of Use
Privacy Policy
© 2019-2025 Picovoice Inc.