Cheetah Streaming Speech-to-Text

Low-latency, highly accurate real-time transcription SDK

Build on-device voice AI agents, dictation apps, and conversational interfaces that run even on embedded. Customize with domain vocabulary for higher accuracy.

10.1%
WER (English) vs. 11.9% Google and 10.6% Moonshine Medium
0.08
CPU Core-Hour vs. 3.36 Moonshine Medium, 40x less
8.6%
WER (Spanish) vs. 11.6% Google and 9.4% Azure
What is Cheetah Streaming Speech-to-Text?

Streaming ASR, built for apps that can't wait or be wrong

Cheetah is an enterprise-ready on-device streaming speech-to-text engine built for guaranteed low latency, domain-specific accuracy, and cross-platform production deployment. It transcribes speech in real time as it is spoken, runs entirely offline across platforms, and is private by architecture.

Cloud-based real-time transcription introduces network latency that makes response time unpredictable, introduces data privacy exposure, and unbounded costs at scale. On-device processing eliminates inherent cloud limitations. However, most local engines struggle to match cloud accuracy without sacrificing latency or compute efficiency.

Cheetah is the only on-device real-time transcription engine that matches cloud WER (beats Google in every benchmark, and Azure in some benchmarks) even before it's customized for the use case, and requires less compute than any other local engine tested. No tradeoff on accuracy, latency, or privacy, and no minimum hardware requirements.

Developer Experience

Custom real-time transcription in under 3 lines

A single API handles audio streaming, endpoint detection, and transcript emission. Partial transcripts arrive word by word as they're uttered. Use Cheetah Streaming STT with its native SDKs for Python, NodeJS, Android, iOS, Java, .NET, React, Flutter, React Native, C, and Web.

Open-Source Real-Time Transcription Benchmark

Proven accuracy vs. Amazon, Azure, and Google Streaming STTs — across languages

Cloud transcription APIs have one advantage over on-device alternatives: accuracy. Cheetah Streaming Speech-to-Text closes that gap despite its minimal size. Cheetah beats Google Streaming STT across six languages and both Azure and Google in French, Spanish, and Italian.

English Word Error Rate
Lower is better
Amazon Streaming5.6%
Azure Real-time8.2%
Cheetah Streaming10.1%
Moonshine Streaming Medium10.6%
Vosk Streaming Large11.5%
Google Streaming11.9%
Whisper.cpp Streaming Base19.8%
English Punctuation Error Rate
Lower is better
Cheetah Streaming16.1%
Azure Real-time16.4%
Amazon Streaming24.4%
Google Streaming36%
Moonshine Streaming Medium45.1%
Whisper.cpp Streaming Base54.1%
Word Emission Latency
Lower is better
Azure Real-time530 ms
Cheetah Streaming590 ms
Moonshine Streaming Medium640 ms
Google Streaming830 ms
Amazon Streaming920 ms
Whisper.cpp Streaming Base1,240 ms
Vosk Streaming Large2,000 ms
Core Hour Ratio
Lower is better
Cheetah Streaming0.083x
Vosk Streaming Large0.12x
Whisper.cpp Streaming Base1.67x
Moonshine Streaming Medium3.36x
French Word Error Rate
Lower is better
Amazon Streaming9.3%
Cheetah Streaming13.6%
Azure Streaming15.8%
Google Streaming18.5%
Spanish Word Error Rate
Lower is better
Amazon Streaming6.4%
Cheetah Streaming8.6%
Azure Streaming9.4%
Google Streaming11.6%
Italian Word Error Rate
Lower is better
Amazon Streaming11.5%
Cheetah Streaming14.3%
Azure Streaming18%
Google Streaming18.5%
German Word Error Rate
Lower is better
Amazon Streaming8.4%
Azure Streaming10%
Cheetah Streaming11.9%
Google Streaming16.1%
Ready to integrate? Check our docs to start building or talk to the sales team about enterprise deployment.
Capabilities

Why enterprises choose Cheetah Streaming Speech-to-Text

Cheetah is an enterprise-ready on-device streaming speech-to-text engine built for guaranteed low latency, domain-specific accuracy, and cross-platform production deployment. It transcribes in real time, processes audio data entirely offline, supports custom vocabulary and keyword boosting, and is private by design.

01Real-time transcription with guaranteed response timeCloud APIs introduce unpredictable latency: network transmission, server load, and geographic distance from the inference cluster. Cheetah's response time is not limited by connectivity, giving a guaranteed latency whether it's used in a hospital corridor, a factory floor, or a subway tunnel.
02Custom vocabularyA generic STT model cannot know your upcoming product or customer-specific jargon. Cheetah lets product teams add domain-specific vocabulary directly on the Picovoice Console, such as medical terms, industry-specific phrases, and proper nouns, so the transcription model knows the domain before it hears a single word. No ML expertise, no labelled audio, no retraining pipeline required. Learn how to add custom vocabulary.
03Keyword boostingBeyond adding vocabulary, product teams increase the probability weight of specific keywords that appear frequently in conversations. Sales call analytics platform hears "renewal", "proposal", and "churn" frequently and doesn't want to miss them. Boosting them reduces misrecognition for exactly the terms that matter most. Learn how to boost keywords.
04Programmatic custom vocabulary APIThe Cheetah Streaming Speech-to-Text Custom Model API lets both developers and end users add custom vocabulary and boost keywords from any device via cloud API, without visiting Picovoice Console.
05Dedicated Model TrainingPicovoice researchers optimize real-time transcription models further for any domain, hardware, acoustic environment, speaker accent, or noise condition through Non-Recurring Engineering (NRE) engagements.
06SOTA accuracy in French, Spanish, Italian, German, and PortugueseMost ASR providers have invested disproportionately in English. Cheetah's dedicated language models outperform both Azure Real-Time Speech-to-Text and Google Streaming Speech-to-Text on WER in French, Spanish, and Italian, without a cloud dependency. Product teams building for European markets choose Cheetah Streaming Speech-to-Text for a reliable user experience in both English and local languages.
07Tunable endpointingEndpointing determines when an ASR considers a sentence complete to emit the final transcript. Cheetah Streaming Speech-to-Text gives product teams direct control over endpointing duration: shorter for fast-response applications, longer for users who speak slowly or pause mid-sentence.
08Automatic Punctuation and TruecasingCheetah Streaming Speech-to-Text automatically applies commas, periods, question marks, and proper capitalisation based on prosody and sentence structure without requiring post-processing or a separate punctuation model. Cheetah returns transcripts that are immediately readable and ingestible by downstream NLP and LLM pipelines, and suitable for human-facing displays. Cheetah achieves 16.1% PER in English — the best result of any engine in the benchmark, ahead of Amazon (24.4%), Azure (16.4%), and Google (36.0%).
09NormalizationInverse Text Normalization (ITN), or Normalization for short, is a speech-to-text process that converts verbalized forms into their corresponding symbolic written forms. While end-users speak naturally, Cheetah Streaming Speech-to-Text formats cardinal numbers, decimal numbers, and math characters as expected
i'll have two coffees please
I'll have 2 coffees please.
we need twelve point three five grams of coffee
We need 12.35 grams of coffee.
let me do the math two hundred divided by sixty five thousand equals forty percent
Let me do the math. 200 / 65000 = 40%.
10Private by DesignCheetah Streaming Speech-to-Text processes audio entirely on-device. No audio data is transmitted to any server, no cloud logs are created, and no third-party data retention occurs, making Cheetah Streaming Speech-to-Text GDPR, HIPAA, and CCPA compliant by architecture — not policy.
11Cross-PlatformCheetah Streaming Speech-to-Text runs on every platform your product ships — Android, Chrome, Edge, Firefox, iOS, Linux, macOS, Raspberry Pi, Safari, and Windows — across AMD, Intel, NVIDIA, and Qualcomm hardware.
12Enterprise ReadyCheetah Streaming Speech-to-Text is production-grade and enterprise-ready. Picovoice offers flexible licensing, dedicated engineering support, NDA-protected custom model training, and SLA-backed response times for teams shipping at scale.
13Only feasible on-device streaming ASR for productionEvery on-device alternative trades something. Moonshine Streaming achieves competitive accuracy but at 40× of Cheetah's compute cost and with a model too large for OTA delivery. Vosk Streaming keeps compute reasonable but emits words too slowly for responsive applications and ships a 2.7GB model. Whisper.cpp Streaming is in the middle across every dimension — mediocre accuracy, speed, and model size. Cheetah is the only on-device streaming ASR that delivers cloud-competitive accuracy with a small model size (34MB), high compute efficiency (0.083x core hour ratio), and low word emission latency (590ms) — simultaneously.

Ship it.
On device.

Fast, accurate, and lightweight real-time transcription

FAQ

Common questions about real-time transcription

+
What is a real-time transcription engine?

Real-time transcription engine or streaming speech-to-text engine transcribes audio in real time as speech is spoken, emitting partial and final transcripts word by word rather than waiting for a complete utterance or audio file. Unlike batch transcription — which processes complete recordings after the fact — streaming ASR operates on continuous audio streams, making it suitable for live captioning, voice assistants, call analytics, and any application where immediate transcript feedback matters.

+
What are the use cases and applications of Streaming Speech-to-Text?

Streaming Speech-to-Text is used in real-time captioning and meeting transcription, voice assistants and conversational AI agents, IVR and call centre analytics, healthcare documentation with hands-free clinical workflows, legal transcription, e-learning platforms, and any application where transcripts need to be available as speech is happening rather than after the fact.

+
How does on-device real-time transcription differ from cloud-based real-time transcription APIs?

Cloud-based real-time transcription APIs record and send voice data to vendor servers where the transcription engine resides to convert voice into text. On-device real-time transcription brings the transcription engine where voice data is, offering guaranteed real-time experience by eliminating unpredictable delays.

+
Can I use Cheetah Streaming Speech-to-Text in the cloud?

Yes. You can run Cheetah Streaming Speech-to-Text in the cloud, whether private, public, or hybrid. Picovoice on-device voice recognition technology allows enterprises to decide where to run the transcription engine instead of making the Picovoice cloud mandatory for voice processing.

+
What are the key metrics for evaluating real-time transcription engines?

The key metrics for real-time transcription are

Cheetah is the only engine that leads all five categories among on-device streaming alternatives, as proven by the open-source real-time transcription benchmark.

+
What is word error rate (WER)?

Word error rate is the ratio of transcription errors to the total number of words spoken — the standard metric for measuring speech-to-text accuracy. A WER of 10% means 10 words in every 100 were transcribed incorrectly. Lower WER means better accuracy. However, there are nuances in comparing WER figures. WER is a generic method and treats all errors equally; test data sets affect the WER scores. Picovoice's open-source real-time transcription benchmark measures Cheetah's WER against Amazon Transcribe, Azure, and Google across English, French, German, Spanish, Italian, and Portuguese using publicly available datasets, making results reproducible and independently verifiable.

+
What is the punctuation error rate (PER)?

Punctuation error rate (PER) measures how accurately a transcription engine places punctuation — periods and question marks — relative to a reference transcript. PER matters for applications where transcripts are read by humans or processed downstream: meeting notes, call summaries, legal transcription, and closed captions all depend on accurate punctuation to be usable. Cheetah achieves 16.1% PER in English — the best result of any engine in the benchmark, ahead of Amazon (24.4%), Azure (16.4%), and Google (36.0%).

+
What is word emission latency?

Word emission latency is the average delay from when a word finishes being spoken to when an ASR engine emits it in a transcript. It is the key responsiveness metric for streaming ASR — lower means better synchronisation between speech and text output. Cheetah achieves 590 ms word emission latency in English — the lowest of any on-device engine tested, and faster than Amazon (920 ms) and Google (830 ms).

+
How does Cheetah Streaming STT compare to Amazon Transcribe Streaming?

In the open-source real-time transcription benchmark, Amazon Transcribe Streaming leads on WER in English and all non-English languages tested, whereas Cheetah Streaming STT beats Amazon Transcribe Streaming on word emission latency (590 ms vs 920 ms) and punctuation accuracy in English (16.1% vs. 24.4% PER). Cheetah Streaming STT's advantage over Amazon Transcribe Streaming is architectural: Cheetah Streaming STT runs entirely on-device, eliminating network latency. For applications where privacy, offline capability, or predictable cost at scale are requirements, Cheetah is the stronger choice regardless of Amazon's WER advantage.

You can reproduce the open-source benchmark to measure Amazon Transcribe Streaming WER, PER, and latency figures to compare to Cheetah Streaming STT.

+
How does Cheetah compare to Azure Real-Time Speech-to-Text?

Cheetah outperforms Azure Real-Time Speech-to-Text on WER in French (13.6% vs 15.8%), Spanish (8.6% vs 9.4%), and Italian (14.3% vs 18.5%) whereas Azure Real-time STT beats Cheetah in English (8.2% vs. 10.1%), German (10% vs. 11.9%) and Portuguese (9.7% vs. 12.3%) in the open-source real-time transcription benchmark. Where Cheetah wins decisively over Azure Real-time STT is offline capability. For applications where privacy, offline capability, or predictable cost at scale are requirements, Cheetah is the stronger choice over Azure Real-time STT.

You can reproduce the open-source benchmark to measure Azure Real-time STT WER, PER, and latency figures to compare to Cheetah Streaming STT.

+
How does Cheetah compare to Google Streaming Speech-to-Text?

As proven in the open-source real-time transcription benchmark, Cheetah outperforms Google Streaming Speech-to-Text on WER across all six languages in the benchmark — English (10.1% vs 11.9%), French (13.6% vs 18.5%), German (11.9% vs 16.1%), Spanish (8.6% vs 11.6%), Italian (14.3% vs 18.0%), and Portuguese (12.3% vs 12.8%). The punctuation accuracy gap is even wider — Cheetah's 16.1% PER versus Google's 36.0% in English, and 20.3% versus Google's 48.5% in Spanish. Cheetah also runs entirely on-device, replacing Google's per-minute API cost and cloud dependency with offline processing.

You can reproduce the open-source benchmark to measure Google STT Streaming WER, PER, and latency figures to compare to Cheetah Streaming STT.

+
How does Cheetah compare to Whisper?

OpenAI Whisper does not support real-time streaming. It processes audio in 30-second segments and cannot emit partial transcripts as speech occurs, making it unsuitable for live captioning, voice assistants, or any application requiring immediate feedback. Cheetah is purpose-built for streaming, delivers lower word emission latency than any Whisper variant tested, and at 34MB is significantly smaller than any Whisper model worth deploying for accuracy-sensitive applications.

+
How does Cheetah compare to Whisper.cpp Streaming?

OpenAI Whisper does not support real-time streaming, but there are streaming ASRs, such as Whisper.cpp Streaming derived from Whisper. As proven in the open-source real-time transcription benchmark, Whisper.cpp Streaming sits in the middle of the benchmark across every dimension — not accurate enough to lead, not fast enough to compete on latency, not small enough to justify the accuracy tradeoff. In English, Whisper.cpp Streaming Base achieves 19.8% WER versus Cheetah's 10.1%, with a 1,240 ms word emission latency versus Cheetah's 590 ms, and a 139MB model versus Cheetah's 34MB. Whisper.cpp Streaming Tiny reduces the model size to 73MB and latency to 1,240 ms, but WER worsens to 22.4%. Neither variant offers a meaningful advantage over Cheetah on any single metric.

You can reproduce the open-source benchmark to measure Whisper.cpp Streaming WER, PER, and latency figures to compare to Cheetah Streaming STT.

+
How does Cheetah compare to Moonshine Streaming?

As proven in the open-source real-time transcription benchmark, Moonshine Streaming achieves competitive word accuracy, e.g., Moonshine Medium reaches 10.6% WER in English, close to Cheetah's 10.1% — but at an enormous compute and storage cost. Moonshine Medium's 3.36x core-hour ratio versus Cheetah's 0.08x depicts a 40× efficiency gap. Furthermore, Moonshine Medium ships a 290MB model versus Cheetah's 34MB. Moonshine Tiny reduces resource requirements, but WER degrades to 23.9%. For applications where accuracy matters and compute budget or OTA delivery constraints exist, Moonshine is not a viable production option. Cheetah delivers better accuracy at a fraction of the resource cost.

You can reproduce the open-source benchmark to measure Moonshine Streaming WER, PER, and latency figures to compare to Cheetah Streaming STT.

+
How does Cheetah compare to Vosk Streaming?

As proven in the open-source real-time transcription benchmark, Vosk Streaming Large achieves 11.5% WER in English — comparable to Cheetah's 10.1% — but with a 2,000 ms word emission latency, more than 3× slower than Cheetah's 590 ms, and a 2,733MB model that rules it out for any mobile, embedded, or OTA deployment. Vosk Streaming Small reduces the model to 66MB and compute to a reasonable 0.11x core-hour ratio, but WER degrades to 18.4%, and word emission latency remains at 920 ms. Vosk offers no configuration where it simultaneously matches Cheetah Streaming Speech-to-Text on accuracy, latency, model size, and compute efficiency. For production applications where any of these dimensions matter, Cheetah is the stronger choice.

You can reproduce the open-source benchmark to measure Vosk Streaming WER, PER, and latency figures to compare to Cheetah Streaming STT.

+
Which platforms does Cheetah Streaming Speech-to-Text support?
+
Can Cheetah handle noisy environments and accents?

Yes. Cheetah Streaming Speech-to-Text is trained on diverse audio conditions including background noise, multiple speakers, and various accents. For specialized environments or specific accent patterns, contact sales for custom training options.

+
How does deployment work for high-availability applications?

Cheetah Streaming Speech-to-Text runs entirely within your infrastructure, eliminating external dependencies that could cause outages. You can deploy across multiple instances, regions, or availability zones using standard load balancing and failover strategies.

+
Can I customize Cheetah Streaming Speech-to-Text for domain-specific vocabulary?

Yes, you can train custom speech-to-text models on Picovoice Console to optimize Cheetah Streaming Speech-to-Text for specific industries, terminologies, or use cases. This includes medical terminology, legal language, technical jargon, or company-specific vocabulary.

+
What happens if I need to process multiple languages in the same application?

Cheetah Streaming Speech-to-Text can be configured to handle multiple languages through separate instances or language-specific models. Contact sales to determine whether a multilingual model or a spoken language identification is a better fit for your use case.

+
Which languages does Cheetah Streaming Speech-to-Text support?

Cheetah Streaming Speech-to-Text currently supports English, French, German, Italian, Portuguese, and Spanish.

+
What should I do to request Cheetah Streaming Speech-to-Text to support other languages?

Contact sales to tell us about your commercial endeavor and ask for speech-to-text language support.

+
How do I get technical support for Cheetah Streaming Speech-to-Text?

Picovoice docs, blog, Medium posts, and GitHub are great resources to learn about voice AI, Picovoice technology, and how to start building transcription products. Enterprise customers get dedicated support specific to their applications from Picovoice Product & Engineering teams. Reach out to your Picovoice contact or talk to sales to discuss support options.

+
How can I get informed about updates and upgrades?

Version changes appear in the and LinkedIn. Subscribing to GitHub is the best way to get notified of patch releases. If you enjoy building with Cheetah Streaming Speech-to-Text, show it by giving a GitHub star!