Speech-to-Speech Translation

Build on-device speech-to-speech translation for mobile and embedded devices

Build a speech-to-speech translation that automatically detects the spoken language and runs locally on Android, iOS, Mac, Windows, Linux, embedded and web. No cloud, no latency.

Products used

Bat Spoken Language ID Cheetah Streaming Speech-to-Text Zebra Translate Orca Streaming Text-to-Speech

Platforms supported

AndroidiOSLinuxmacOSWindowsChromeEdgeFirefoxSafariRaspberry Pi

Start Building

Speech-to-Speech Translation

idle

Select languages

→

After starting the demo, speak freely or read the below text aloud.

English

“The passengers should arrive at the airport three hours before their international flight departs.”

Click Start to begin speech-to-speech translation.

BBat Spoken Language ID

CCheetah Streaming STT

ZZebra Translate

OOrca TTS

Loved by developers, trusted by enterprises

How speech-to-speech translation is built

On-device Voice AI and Language SDKs in a single pipeline

On-device speech-to-speech translation combines Bat Spoken Language Identification, Cheetah Streaming Speech-to-Text, Zebra Translate, and Orca Streaming Text-to-Speech in a single local pipeline. Most implementations require the user to select a source language upfront or still route at least one stage through a cloud API. This pipeline eliminates both constraints: language is detected automatically, and every stage runs on the device.

Why Bat Spoken Language Identification?

Automatic source-language detection. No user input required.

93%

Accuracy, 2× fewer errors than SpeechBrain LangID

5 MB

Peak memory - 62x less than SpeechBrain LangID

0.004x

Core-hour ratio - 9x less than SpeechBrain LangID

Bat identifies the spoken language automatically from the audio stream and routes it to the correct recognition model with no user input required. A French speaker and a German speaker can both address the same device without any configuration change. Runs on-device with no audio uploaded.

Accuracy - higher is better

Bat Spoken LID93%

SpeechBrain LID85%

* Benchmarked on VoxLingua107. SpeechBrain also uses it in training its language ID model. Picovoice Bat is trained on proprietary data. Since SpeechBrain is tested on the training data, SpeechBrain's accuracy may be higher than it would be on unseen data.

Miss Rate - lower is better

Bat Spoken LID7%

SpeechBrain LID15%

Why Cheetah Streaming Speech-to-Text?

Lowest latency. Lowest compute. No accuracy tradeoff.

10.1%

WER (English) vs. 11.9% Google and 10.6% Moonshine Medium

0.08

CPU Core-Hour vs. 3.36 Moonshine Medium, 40x less

8.6%

WER (Spanish) vs. 11.6% Google and 9.4% Azure

Cheetah Streaming Speech-to-Text beats Google Cloud STT in word error rate and word emission latency across all tested languages, and outperforms Azure STT in several benchmarks, per the open-source real-time transcription benchmark — even before it's customized for the use case. It emits words at 590 ms median latency, typically one word behind the speaker. Cheetah requires less compute than any other local engine tested. Result: no tradeoff on accuracy, latency, or privacy, and no minimum hardware requirements.

English Word Error Rate

Lower is better

Amazon Streaming5.6%

Azure Real-time8.2%

Cheetah Streaming10.1%

Moonshine Streaming Medium10.6%

Vosk Streaming Large11.5%

Google Streaming11.9%

Whisper.cpp Streaming Base19.8%

* Average of 4 public datasets. The lowest-scoring (highest accuracy) model is shown for Moonshine, Vosk, and Whisper.cpp. See the benchmark for the full comparison.

English Punctuation Error Rate

Lower is better

Cheetah Streaming16.1%

Azure Real-time16.4%

Amazon Streaming24.4%

Google Streaming36%

Moonshine Streaming Medium45.1%

Whisper.cpp Streaming Base54.1%

* Average of 3 public datasets. Vosk excluded due to no punctuation support. See the benchmark for details.

Why Zebra Translate?

120 words per second. Opus-level accuracy. Zero network latency.

<100 MB

Peak Memory Usage

<80 MB

Model size per language pair

1:1

Opus MT by Helsinki NLP accuracy match

Zebra, Picovoice's on-device translation SDK, returns up to 120 words per second — far faster than the average 2 words per second a person speaks or the 5 words per second a person reads. The speed does not come at the cost of accuracy: it matches the accuracy of Opus, one of the best-known open translation models. On-device processing also gives Zebra a competitive edge: zero network latency, which can never be achieved by cloud translation APIs such as Google Translate and DeepL.

Translation accuracy (BLEU) — higher is better

Zebra (DE → EN)51

Opus (DE → EN)51

Zebra (EN → FR)55

Opus (EN → FR)55

Zebra (ES → IT)58

Opus (ES → IT)57

* Tatoeba test set · Scores rounded to nearest integer

Translation speed (words / sec) — higher is better

Zebra (DE → EN)112

Opus (DE → EN)45

Zebra (EN → FR)105

Opus (EN → FR)41

Zebra (ES → IT)90

Opus (ES → IT)36

* Translation speed (words/sec) measured on AMD Ryzen 9

Why Orca Streaming Text-to-Speech?

29 MB peak memory. Natural-sounding TTS in any environment.

29 MB

Peak Memory Usage

130 ms

First-token-to-speech latency

7 MB

Model Size

Most high-quality TTS solutions require hundreds of megabytes of RAM. Orca TTS uses 29 MB peak memory, 10–50× less than any other on-device alternative, except for ESpeak. This makes Orca the only natural-sounding TTS deployable in any environment, including browser tabs, mobile apps with strict out-of-memory limits, and embedded devices.

TTS VART

Lower is better

Picovoice Orca204 ms

ElevenLabs Streaming504 ms

ESPEAK-NG1,504 ms

ElevenLabs TTS1,548 ms

* Values capped for display purposes; actual VART reported.

TTS Latency

Lower is better

Orca TTS Streaming128 ms

ElevenLabs TTS Streaming335 ms

ESpeak TTS1,430 ms

ElevenLabs TTS1,470 ms

* Values capped at 4,000ms for display purposes; actual latency reported.

Where speech-to-speech translation ships

Accurate speech translation from consumer devices to enterprise fieldwork

Field interviews and body cameras

On-device voice translation for field interviews and body cameras

Officers conducting interviews, taking statements, and communicating with witnesses or suspects who speak different languages in the field have no time to wait for a human interpreter. Running a voice translator on-device matters for two reasons: body cameras and patrol devices frequently operate in areas with poor connectivity, and audio of police interactions cannot be routed through a third-party cloud service.

Paramedics, firefighters and dispatchers

Real-time speech translation for first responders

Paramedics, firefighters, and emergency dispatchers face language barriers in situations where seconds matter. Someone in stress who cannot describe symptoms, or a bystander who cannot explain what happened, can delay critical decisions. On-device translation works even when connectivity is poor, adds no perceptible latency, and keeps sensitive audio and personally identifiable info off external servers.

Travel and tourism

Offline voice translation for travel and tourism apps

Tourists and expats navigate markets, hotels, restaurants, and transport in countries where they do not speak the language. The on-device advantage: translation works on a plane, underground, in rural areas, and anywhere roaming data is expensive or unavailable. No cloud dependency means no degraded experience when the signal drops mid-conversation, regardless of where they are.

Border interviews

Private offline translation for border interviews

Border agents interview travelers who may or may not know English or the local language in noisy, time-pressured environments. The audio is sensitive and cannot be sent to a commercial cloud API. Connectivity at land border crossings and remote processing centers is often unreliable. On-device speech translation addresses these challenges by keeping audio off any external server and eliminating network latency.

Conferences and live events

On-device speech translation for conferences, trade shows, and live events

Conferences, summits, trade shows, and sporting events with multilingual attendees need real-time translation that does not depend on venue wifi. Convention center networks are notoriously overloaded during large events. On-device translation runs on the attendee's own device or the event hardware with no dependency on the local network, no per-attendee cloud cost, and no audio passing through an event organizer's infrastructure.

Multilingual call centers

On-device speech translation for multilingual call centers

Agents may need to handle calls from customers who speak different languages or just identify the language to route to the right person. Agents can follow any conversation in real time and respond through synthesized speech in the caller's language. On-device voice translators eliminate the latency introduced by cloud translation APIs on every utterance, with no audio sent to a third-party service.

Get started

Build an on-device speech-to-speech translation app in 3 steps

A complete working recipe in Python. Open-source on GitHub. Runs 100% on-device.

recipe · speech-to-speech-translation

Difficulty

Beginner

Runtime

100% on-device

Language

Python

Platforms supported

AndroidiOSLinuxmacOSWindowsChromeEdgeFirefoxSafariRaspberry Pi

Prerequisites

Picovoice AccessKey from Picovoice Console and GitHub Repo Clone.

Usage

These instructions assume your current working directory is recipes/speech-to-speech-translation/python.

1

Create a virtual environment

Isolate the recipe's dependencies from your system Python.

2

Activate the virtual environment

Activation makes pip install into .venv instead of system Python.

Linux, macOS, or Raspberry Pi

Windows

3

Install dependencies

Pulls in the Bat, Cheetah, Zebra, and Orca Python SDKs along with audio I/O.

4

Download the Required Models

Run the setup script to download the models for Cheetah Streaming Speech-to-Text, Zebra Translation, and Orca Streaming Text-to-Speech:

5

Run the speech-to-speech translation pipeline

Bat Spoken Language Identification detects the spoken language and assigns the correct automatic speech recognition model. Cheetah Streaming Speech-to-Text transcribes the audio. Zebra Translate translates the transcript into the target language. Orca synthesizes the translated text into speech. All four run locally in the same process and on the same machine.

Have questions or looking for implementations in other languages? Visit GitHub pico-cookbook Speech-to-Speech Translation recipe, where you can find the code and create an issue for demo-related technical questions.

On-device AI cookbook examples

More recipes from picoCookbook

Frequently asked questions

FAQ

+

What is speech-to-speech translation?

Speech-to-speech translation is the automatic conversion of spoken audio in one language into spoken audio in another language, in real time. The speaker's words are detected, translated, and synthesized into the target language without any manual steps. On-device speech-to-speech translation runs this entire pipeline locally, with no audio sent to a cloud service, meaning it works fully offline.

+

Do I need to select the languages before using the translation app?

Bat Spoken Language Identification detects the spoken language automatically from the audio stream. You only need to specify the target language, the language you want to hear the output in. The pipeline handles source language detection and model routing without any user input.

+

Does speech-to-speech translation work offline?

Yes. Bat Spoken Language Identification, Cheetah Streaming Speech-to-Text, Zebra Translate, and Orca Streaming Text-to-Speech all run 100% on-device. Audio and text data are processed locally and never sent to any server. This makes the pipeline suitable for travel, fieldwork, vehicles, and any environment with unreliable or unavailable connectivity.

+

How is this different from Google Translate or DeepL?

Google Translate and DeepL are cloud services. They transmit audio or text to remote servers, introducing network round-trip latency. Picovoice's pipeline runs entirely on the device: no audio is transmitted, zero network latency at the translation stage, ensuring full privacy.

+

How is this different from Microsoft Translator or Apple Translate?

Microsoft Translator and Apple Translate are not available as on-device, embeddable SDKs for arbitrary third-party products. Picovoice's pipeline is a cross-platform SDK that any developer or OEM can integrate — on Android, iOS, Linux, macOS, Windows, embedded hardware, or in a web browser.

+

What languages are supported?

Bat Spoken Language Identification supports detection across a wide set of languages. Cheetah Streaming Speech-to-Text supports English, French, German, Spanish, Italian, and Portuguese. Zebra Translate supports language pairs across English, French, German, Spanish, Italian, Portuguese, Japanese, and Korean. Orca Text-to-Speech generates natural speech in English, French, German, Spanish, Italian, Portuguese, Japanese, and Korean. For detailed language support, check the product pages or documentation of each product.

+

Can I customize the translated voice?

Yes. Orca supports custom pronunciation and speech speed control. Check Orca Text-to-Speech documentation for details.

+

How can I get technical support for the speech-to-speech translation recipe?

Visit GitHub pico-cookbook/speech-to-speech-translation, where you can find the open-source demo code and create an issue for demo-related technical questions.

Build on-device speech-to-speech translation for mobile and embedded devices

On-device Voice AI and Language SDKs in a single pipeline

Automatic source-language detection. No user input required.

Lowest latency. Lowest compute. No accuracy tradeoff.

120 words per second. Opus-level accuracy. Zero network latency.

29 MB peak memory. Natural-sounding TTS in any environment.

Accurate speech translation from consumer devices to enterprise fieldwork

On-device voice translation for field interviews and body cameras

Real-time speech translation for first responders

Offline voice translation for travel and tourism apps

Private offline translation for border interviews

On-device speech translation for conferences, trade shows, and live events

On-device speech translation for multilingual call centers

Build an on-device speech-to-speech translation app in 3 steps

Prerequisites

Usage

Create a virtual environment

Activate the virtual environment

Install dependencies

Download the Required Models

Run the speech-to-speech translation pipeline

More recipes from picoCookbook

Live Conversation Translation

Live Captioning and Translation

Call Screen

FAQ