Speech-to-Speech Translation

Build on-device speech-to-speech translation for mobile and embedded devices

Build a speech-to-speech translation that automatically detects the spoken language and runs locally on Android, iOS, Mac, Windows, Linux, embedded and web. No cloud, no latency.

Platforms supported
AndroidiOSLinuxmacOSWindowsChromeEdgeFirefoxSafariRaspberry Pi
How speech-to-speech translation is built

On-device Voice AI and Language SDKs in a single pipeline

On-device speech-to-speech translation combines Bat Spoken Language Identification, Cheetah Streaming Speech-to-Text, Zebra Translate, and Orca Streaming Text-to-Speech in a single local pipeline. Most implementations require the user to select a source language upfront or still route at least one stage through a cloud API. This pipeline eliminates both constraints: language is detected automatically, and every stage runs on the device.

SpeakerAny language"Bonjour, est-ceque cette place..."BatSPOKEN LANGUAGE IDDetected: FrenchCheetahSTREAMINGSPEECH-TO-TEXTSource transcriptZebraTRANSLATIONTranslated textOrcaTEXT-TO-SPEECH"Hello, is thisseat free?"Repeat for each utterance
Why Bat Spoken Language Identification?

Automatic source-language detection. No user input required.

93%
Accuracy, 2× fewer errors than SpeechBrain LangID
5 MB
Peak memory - 62x less than SpeechBrain LangID
0.004x
Core-hour ratio - 9x less than SpeechBrain LangID

Bat identifies the spoken language automatically from the audio stream and routes it to the correct recognition model with no user input required. A French speaker and a German speaker can both address the same device without any configuration change. Runs on-device with no audio uploaded.

Accuracy - higher is better
Bat Spoken LID93%
SpeechBrain LID85%
Miss Rate - lower is better
Bat Spoken LID7%
SpeechBrain LID15%
Why Cheetah Streaming Speech-to-Text?

Lowest latency. Lowest compute. No accuracy tradeoff.

10.1%
WER (English) vs. 11.9% Google and 10.6% Moonshine Medium
0.08
CPU Core-Hour vs. 3.36 Moonshine Medium, 40x less
8.6%
WER (Spanish) vs. 11.6% Google and 9.4% Azure

Cheetah Streaming Speech-to-Text beats Google Cloud STT in word error rate and word emission latency across all tested languages, and outperforms Azure STT in several benchmarks, per the open-source real-time transcription benchmark — even before it's customized for the use case. It emits words at 590 ms median latency, typically one word behind the speaker. Cheetah requires less compute than any other local engine tested. Result: no tradeoff on accuracy, latency, or privacy, and no minimum hardware requirements.

English Word Error Rate
Lower is better
Amazon Streaming5.6%
Azure Real-time8.2%
Cheetah Streaming10.1%
Moonshine Streaming Medium10.6%
Vosk Streaming Large11.5%
Google Streaming11.9%
Whisper.cpp Streaming Base19.8%
English Punctuation Error Rate
Lower is better
Cheetah Streaming16.1%
Azure Real-time16.4%
Amazon Streaming24.4%
Google Streaming36%
Moonshine Streaming Medium45.1%
Whisper.cpp Streaming Base54.1%
Why Zebra Translate?

120 words per second. Opus-level accuracy. Zero network latency.

<100 MB
Peak Memory Usage
<80 MB
Model size per language pair
1:1
Opus MT by Helsinki NLP accuracy match

Zebra, Picovoice's on-device translation SDK, returns up to 120 words per second — far faster than the average 2 words per second a person speaks or the 5 words per second a person reads. The speed does not come at the cost of accuracy: it matches the accuracy of Opus, one of the best-known open translation models. On-device processing also gives Zebra a competitive edge: zero network latency, which can never be achieved by cloud translation APIs such as Google Translate and DeepL.

Translation accuracy (BLEU) — higher is better
Zebra (DE → EN)51
Opus (DE → EN)51
Zebra (EN → FR)55
Opus (EN → FR)55
Zebra (ES → IT)58
Opus (ES → IT)57
Translation speed (words / sec) — higher is better
Zebra (DE → EN)112
Opus (DE → EN)45
Zebra (EN → FR)105
Opus (EN → FR)41
Zebra (ES → IT)90
Opus (ES → IT)36
Why Orca Streaming Text-to-Speech?

29 MB peak memory. Natural-sounding TTS in any environment.

29 MB
Peak Memory Usage
130 ms
First-token-to-speech latency
7 MB
Model Size

Most high-quality TTS solutions require hundreds of megabytes of RAM. Orca TTS uses 29 MB peak memory, 10–50× less than any other on-device alternative, except for ESpeak. This makes Orca the only natural-sounding TTS deployable in any environment, including browser tabs, mobile apps with strict out-of-memory limits, and embedded devices.

TTS VART
Lower is better
Picovoice Orca204 ms
ElevenLabs Streaming504 ms
ESPEAK-NG1,504 ms
ElevenLabs TTS1,548 ms
TTS Latency
Lower is better
Orca TTS Streaming128 ms
ElevenLabs TTS Streaming335 ms
ESpeak TTS1,430 ms
ElevenLabs TTS1,470 ms
Where speech-to-speech translation ships

Accurate speech translation from consumer devices to enterprise fieldwork

Field interviews and body cameras

On-device voice translation for field interviews and body cameras

Officers conducting interviews, taking statements, and communicating with witnesses or suspects who speak different languages in the field have no time to wait for a human interpreter. Running a voice translator on-device matters for two reasons: body cameras and patrol devices frequently operate in areas with poor connectivity, and audio of police interactions cannot be routed through a third-party cloud service.

Paramedics, firefighters and dispatchers

Real-time speech translation for first responders

Paramedics, firefighters, and emergency dispatchers face language barriers in situations where seconds matter. Someone in stress who cannot describe symptoms, or a bystander who cannot explain what happened, can delay critical decisions. On-device translation works even when connectivity is poor, adds no perceptible latency, and keeps sensitive audio and personally identifiable info off external servers.

Travel and tourism

Offline voice translation for travel and tourism apps

Tourists and expats navigate markets, hotels, restaurants, and transport in countries where they do not speak the language. The on-device advantage: translation works on a plane, underground, in rural areas, and anywhere roaming data is expensive or unavailable. No cloud dependency means no degraded experience when the signal drops mid-conversation, regardless of where they are.

Border interviews

Private offline translation for border interviews

Border agents interview travelers who may or may not know English or the local language in noisy, time-pressured environments. The audio is sensitive and cannot be sent to a commercial cloud API. Connectivity at land border crossings and remote processing centers is often unreliable. On-device speech translation addresses these challenges by keeping audio off any external server and eliminating network latency.

Conferences and live events

On-device speech translation for conferences, trade shows, and live events

Conferences, summits, trade shows, and sporting events with multilingual attendees need real-time translation that does not depend on venue wifi. Convention center networks are notoriously overloaded during large events. On-device translation runs on the attendee's own device or the event hardware with no dependency on the local network, no per-attendee cloud cost, and no audio passing through an event organizer's infrastructure.

Multilingual call centers

On-device speech translation for multilingual call centers

Agents may need to handle calls from customers who speak different languages or just identify the language to route to the right person. Agents can follow any conversation in real time and respond through synthesized speech in the caller's language. On-device voice translators eliminate the latency introduced by cloud translation APIs on every utterance, with no audio sent to a third-party service.

Get started

Build an on-device speech-to-speech translation app in 3 steps

A complete working recipe in Python. Open-source on GitHub. Runs 100% on-device.

recipe · speech-to-speech-translation
Difficulty
Beginner
Runtime
100% on-device
Language
Python
Platforms supported
AndroidiOSLinuxmacOSWindowsChromeEdgeFirefoxSafariRaspberry Pi

Prerequisites

Picovoice AccessKey from Picovoice Console and GitHub Repo Clone.

Usage

These instructions assume your current working directory is recipes/speech-to-speech-translation/python.
1

Create a virtual environment

Isolate the recipe's dependencies from your system Python.
2

Activate the virtual environment

Activation makes pip install into .venv instead of system Python.
Linux, macOS, or Raspberry Pi
Windows
3

Install dependencies

Pulls in the Bat, Cheetah, Zebra, and Orca Python SDKs along with audio I/O.
4

Download the Required Models

Run the setup script to download the models for Cheetah Streaming Speech-to-Text, Zebra Translation, and Orca Streaming Text-to-Speech:
5

Run the speech-to-speech translation pipeline

Bat Spoken Language Identification detects the spoken language and assigns the correct automatic speech recognition model. Cheetah Streaming Speech-to-Text transcribes the audio. Zebra Translate translates the transcript into the target language. Orca synthesizes the translated text into speech. All four run locally in the same process and on the same machine.
Have questions or looking for implementations in other languages? Visit GitHub pico-cookbook Speech-to-Speech Translation recipe, where you can find the code and create an issue for demo-related technical questions.
Frequently asked questions

FAQ

+
What is speech-to-speech translation?
Speech-to-speech translation is the automatic conversion of spoken audio in one language into spoken audio in another language, in real time. The speaker's words are detected, translated, and synthesized into the target language without any manual steps. On-device speech-to-speech translation runs this entire pipeline locally, with no audio sent to a cloud service, meaning it works fully offline.
+
Do I need to select the languages before using the translation app?
Bat Spoken Language Identification detects the spoken language automatically from the audio stream. You only need to specify the target language, the language you want to hear the output in. The pipeline handles source language detection and model routing without any user input.
+
Does speech-to-speech translation work offline?
Yes. Bat Spoken Language Identification, Cheetah Streaming Speech-to-Text, Zebra Translate, and Orca Streaming Text-to-Speech all run 100% on-device. Audio and text data are processed locally and never sent to any server. This makes the pipeline suitable for travel, fieldwork, vehicles, and any environment with unreliable or unavailable connectivity.
+
How is this different from Google Translate or DeepL?
Google Translate and DeepL are cloud services. They transmit audio or text to remote servers, introducing network round-trip latency. Picovoice's pipeline runs entirely on the device: no audio is transmitted, zero network latency at the translation stage, ensuring full privacy.
+
How is this different from Microsoft Translator or Apple Translate?
Microsoft Translator and Apple Translate are not available as on-device, embeddable SDKs for arbitrary third-party products. Picovoice's pipeline is a cross-platform SDK that any developer or OEM can integrate — on Android, iOS, Linux, macOS, Windows, embedded hardware, or in a web browser.
+
What languages are supported?
Bat Spoken Language Identification supports detection across a wide set of languages. Cheetah Streaming Speech-to-Text supports English, French, German, Spanish, Italian, and Portuguese. Zebra Translate supports language pairs across English, French, German, Spanish, Italian, Portuguese, Japanese, and Korean. Orca Text-to-Speech generates natural speech in English, French, German, Spanish, Italian, Portuguese, Japanese, and Korean. For detailed language support, check the product pages or documentation of each product.
+
Can I customize the translated voice?
Yes. Orca supports custom pronunciation and speech speed control. Check Orca Text-to-Speech documentation for details.
+
How can I get technical support for the speech-to-speech translation recipe?