Orca Streaming Text-to-Speech

Low-latency text-to-speech for real-time LLM agents

On-device text-to-speech that starts synthesizing as LLMs generate tokens. No cloud, just 29 MB of peak RAM and 7 MB model size, achieving 130 ms latency, 2.6× faster than ElevenLabs Streaming.

English57/200
>
29 MB
Peak Memory Usage
130 ms
First-token-to-speech latency
7 MB
Model Size
What is Orca Streaming Text-to-Speech?

The TTS engine that voice AI agents don't have to wait for

Orca Streaming Text-to-Speech is an enterprise-ready on-device streaming text-to-speech engine built for LLM-powered voice AI applications. It synthesizes speech concurrently as LLMs generate tokens, runs entirely offline across platforms, and is private by architecture.

Unlike Orca, most text-to-speech engines work in a single mode: wait for the full text, then synthesize. That made sense when text inputs were static. It doesn't work when you're building a voice AI agent backed by an LLM that produces output token by token. In many interactions, Orca finishes before competing TTS engines even begin. That's why it achieves 130 ms first-token-to-speech latency, 2.6× faster than the closest competitor, Eleven Labs.

Developer Experience

Streaming speech synthesis in a few lines of code

Orca's streaming mode accepts text chunks one at a time and returns audio as soon as there's enough context to synthesize. Use Orca Streaming TTS native SDKs for Python, NodeJS, Android, iOS, .NET, C, and Web to feed LLM tokens directly without using any additional buffering logic.

Open-Source Text-to-Speech Benchmark

Lowest latency — proven by a reproducible benchmark

Orca Streaming Text-to-Speech is benchmarked against 15 popular cloud and on-device TTS alternatives using ~200 simulated voice AI agent interactions. Results show Orca is the only TTS engine that responds in less than 200 ms and achieves natural-sounding voices at under 30 MB peak memory.

TTS Latency
Lower is better
Orca TTS Streaming128 ms
ElevenLabs TTS Streaming335 ms
ESpeak TTS1,430 ms
ElevenLabs TTS1,470 ms
Audio Quality
Listen and compare — grouped by peak memory usage.
Peak Memory Usage < 30 MB
ESpeak
Orca
TTS CPU Utilization
Lower is better
Orca Streaming TTS0.16x
Pocket TTS0.37x
Piper TTS0.54x
TTS Memory Usage
Lower is better
Orca Streaming TTS29 MB
Kitten TTS Nano320 MB
Supertonic TTS 2520 MB
Ready to integrate? Check our docs to start building or talk to the sales team about enterprise deployment.
Capabilities

Why enterprises choose Orca Streaming Text-to-Speech

Orca is an enterprise-ready on-device streaming text-to-speech engine built for low-latency, natural-sounding human-like voice AI interactions with a minimal resource footprint.

01Streaming synthesisTraditional TTS waits for the complete text before synthesizing. Orca processes raw LLM tokens as they arrive and begins generating audio in parallel. This architectural difference enables 130 ms first-token-to-speech, making Orca the only TTS engine below the 200 ms threshold.
0229 MB peak memoryMost high-quality TTS solutions require hundreds of megabytes of RAM. Orca Streaming TTS uses 29 MB peak memory, 10–50× less than any other on-device alternative, except for ESpeak. This makes Orca the only natural-sounding TTS deployable in any environment, including browser tabs, mobile apps with strict out-of-memory limits, and embedded devices.
037 MB model sizeOrca's 7 MB model size is small enough for over-the-air updates and storage-constrained hardware such as embedded and low-tier mobile devices.
04Private by architectureAudio is synthesized entirely on-device. Unlike cloud TTS APIs, no text is sent to remote servers. Orca Streaming TTS is compliant with GDPR, HIPAA, CCPA, and other data residency regulations by architecture, since Picovoice (or any 3rd party on its behalf) cannot access end-user audio or text.
05Custom voices and emotional stylesCustom Orca TTS models support synthesis with emotions and styles, including joy, anger, whispering, and shouting. Voice cloning and brand voice development are available for enterprise customers through Non-Recurring Engineering (NRE) engagements.
06Tunable speech rateSpeech rate is adjustable between 0.7× and 1.3× without retraining, allowing product teams to adjust synthesis speed based on content complexity and user preferences. Faster for simple confirmations, slower for detailed instructions.
07Cross-PlatformOrca Streaming Text-to-Speech runs on every platform your product ships — Android, Chrome, Edge, Firefox, iOS, Linux, macOS, Raspberry Pi, Safari, and Windows — across AMD, Intel, NVIDIA, and Qualcomm hardware.
08Enterprise ReadyOrca Streaming Text-to-Speech is production-grade and enterprise-ready. Picovoice offers flexible licensing, dedicated engineering support, NDA-protected custom voice model training, and SLA-backed response times for teams shipping at scale.

Ship it.
On device.

Low-latency and lightweight streaming text-to-speech for voice AI agents

FAQ

Common questions about streaming text-to-speech

+
What's Streaming Text-to-Speech?

Streaming text-to-speech synthesizes audio from partial, incomplete text input — token by token — rather than waiting for the full text before generating speech. This matters for LLM-powered voice applications because LLMs produce output incrementally. A standard TTS solution waits until the LLM finishes before synthesizing; streaming TTS like Orca begins synthesizing as the first tokens arrive and generates audio in parallel with the LLM. The result is dramatically lower latency — Orca achieves 130 ms first-token-to-speech versus 1,400 ms or more for other alternatives.

+
What are the use cases and applications of Streaming Text-to-Speech?

Streaming text-to-speech is used in LLM-powered voice AI agents and assistants, IVR systems, healthcare applications with hands-free patient communication, e-learning platforms with AI tutors, customer service bots with real-time voice responses, automotive in-cabin assistants, and any application where a voice AI agent needs to respond without perceptible delay.

+
How does Orca Streaming Text-to-Speech differ from other Text-to-Speech solutions offered for real-time interactions?

The term "streaming" or "real-time" Text-to-Speech (TTS) has been used excessively and often inappropriately, leading to confusion about its true meaning and capabilities. Orca Streaming Text-to-Speech, similar to humans continuously processing text input, reads streaming text inputs out loud as they appear.

The current "real-time" TTS solutions can stream audio as it is generated before the full audio file has been created. They were designed for legacy Natural Language Processing (NLP) techniques that generate output all at once. Today's Large Language Models (LLMs) work differently — they produce text incrementally, token by token. Orca Streaming Text-to-Speech, besides streaming audio output like traditional TTS solutions, can process streaming text input that is generated on a token-by-token basis.

+
Does Orca Streaming Text-to-Speech work offline?

Yes. Orca synthesizes audio entirely on-device. No text is sent to a cloud TTS API, no audio is transmitted to any server. It operates in air-gapped environments, remote deployments, and bandwidth-constrained hardware.

+
How does Orca Streaming Text-to-Speech compare to ElevenLabs Streaming?

Both Orca and ElevenLabs offer streaming TTS, but they differ architecturally. ElevenLabs Streaming TTS processes streaming input by chunking text at punctuation marks and sending those segments to a cloud API via WebSocket. Orca processes raw LLM tokens directly on-device without requiring punctuation boundaries or a network connection.

This architectural difference allows Orca to achieve 130 ms first-token-to-speech versus ElevenLabs Streaming's 340 ms — 2.6× faster. Orca also runs entirely offline, eliminating cloud latency and third-party audio data processing.

You can reproduce the open-source benchmark to measure ElevenLabs Streaming TTS latency.

+
How does Orca Streaming Text-to-Speech compare to ElevenLabs TTS (non-streaming)?

ElevenLabs TTS (non-streaming) requires the full text input before beginning synthesis. In the open-source TTS benchmark, ElevenLabs TTS achieves 1,470 ms first-token-to-speech, which is 11× slower than Orca's 130 ms. ElevenLabs is a cloud-only TTS solution, so independent efficiency measurements of memory, CPU, and model size are not possible.

You can reproduce the open-source benchmark to measure ElevenLabs TTS latency.

+
How does Orca Streaming Text-to-Speech compare to OpenAI TTS?

OpenAI TTS is a cloud API that does not support streaming input. It requires the complete text before synthesizing, not well-suited for LLM agents. In the open-source TTS benchmark, OpenAI TTS achieves 2,850 ms first-token-to-speech latency, which is 22× slower than Orca's 130 ms. OpenAI TTS is a cloud-only solution, so independent efficiency measurements of memory, CPU, and model size are not possible.

You can reproduce the open-source benchmark to measure OpenAI TTS latency.

+
How does Orca Streaming Text-to-Speech compare to Amazon Polly?

Amazon Polly is the second-best cloud performer after ElevenLabs in the open-source TTS benchmark, with its 1,540 ms latency. Since it's a cloud TTS API that does not support streaming input from LLM token streams, it's 12× slower than Orca, as Orca eliminates the cloud round-trip entirely by running offline and starts synthesizing voice from LLM tokens as soon as they make sense.

You can reproduce the open-source benchmark to measure Amazon Polly TTS latency.

+
How does Orca Streaming Text-to-Speech compare to Azure Text-to-Speech?

Azure Text-to-Speech is a cloud API that does not support raw LLM token streaming, just like Amazon Polly and OpenAI TTS. Azure TTS requires a network connection, and its compute runs on Microsoft's infrastructure — preventing independent efficiency measurement. For voice AI agent applications requiring low latency, on-device processing, or offline capability, Orca is the more suitable choice. Orca achieves 130 ms FTTS without a network dependency, whereas Azure TTS latency is measured at 1,580 ms in the same open-source TTS benchmark, and is subject to network conditions and does not support concurrent synthesis with LLM token generation.

You can reproduce the open-source benchmark to measure Azure TTS latency.

+
How does Orca Streaming Text-to-Speech compare to ESpeak-NG?

ESpeak-NG is the second fastest on-device alternative after Orca, with the first-token-to-speech latency of 1,431 ms, which is approximately 11× slower than Orca.

ESpeak TTS checks many boxes: model size, memory usage, and even latency for non-LLM applications. With ESpeak, the primary tradeoff is audio quality: ESpeak TTS produces robotic, synthetic speech, while Orca produces natural, human-like speech. ESpeak TTS is suitable only for highly constrained environments where user experience is not a concern. Otherwise, Orca is better suited for applications prioritizing user experience (audio quality & latency) even for highly constrained environments.

You can reproduce the open-source benchmark to compare Espeak TTS with other alternatives.

+
How does Orca Streaming Text-to-Speech compare to Chatterbox TTS Turbo?

Chatterbox TTS Turbo produces high-quality voices, but it is the least efficient TTS in the open-source TTS benchmark by every measure.It does not support streaming audio output — it must fully synthesize all audio before playback begins .That's why Chatterbox needs 44,708 ms (~45 seconds) from first token to speech. Chatterbox TTS also requires 7,500 MB (7.5 GB) peak memory and 19 CPU core-hours to synthesize one hour of audio, and its model size is 2,980 MB. Orca achieves 130 ms first token to speech, almost 350× faster than Chatterbox, using 29 MB peak memory (260× less) with a 7 MB model.

You can reproduce the open-source benchmark to see Chatterbox TTS compute requirements and latency.

+
How does Orca Streaming Text-to-Speech compare to Kitten TTS Nano?

Kitten TTS Nano does not support streaming audio output. It must fully synthesize all audio before playback begins , producing significant latency (~11 seconds) for voice AI applications, as shown in the open-source TTS benchmark. Kitten TTS Nano requires 320 MB peak memory, placing it in the mid-range memory group, unsuitable for low-end mobile or embedded deployment. Orca uses 29 MB peak memory, approximately 11× less than Kitten TTS Nano, and achieves 130 ms FTTS, approximately 85× faster than Kitten TTS Nano.

You can reproduce the open-source benchmark to compare Kitten TTS Nano against other alternatives.

+
How does Orca Streaming Text-to-Speech compare to Kokoro TTS?

Kokoro TTS is an on-device open-source TTS that does not support streaming input. In the open-source TTS benchmark, Kokoro TTS requires over 1 GB peak memory (vs. Orca's 29 MB), placing it in the group unsuitable for mobile and embedded deployment. It cannot synthesize concurrently with LLM token generation, adding the full synthesis delay on top of LLM latency, resulting in 2,920 ms latency (vs. Orca's 130 ms). For LLM-powered voice AI agents, or any application that runs on mobile, Raspberry Pi, or any memory-constrained environment, Orca is much better suited than Kokoro.

You can reproduce the open-source benchmark to compare Kokoro TTS against other alternatives.

+
How does Orca Streaming Text-to-Speech compare to Neu TTS Nano?

Neu TTS Nano (Q4 GGUF), despite using a quantized version, requires over 1 GB peak memory, making it one of the most resource-intensive TTS alternatives. Neu TTS Nano does not support streaming input and starts playing audio with more than a 2,551 ms delay, as shown in the open-source TTS benchmark. For voice AI agents and applications on consumer mobile devices or embedded hardware, Neu TTS Nano is not a deployable option compared to Orca’s 130ms FTTS (first-token-to-speech) with 29MB peak memory and a 7MB model size. Compared to Neu TTS, Orca is operating in a completely different resource class while delivering speech with much lower latency.

You can reproduce the open-source benchmark to see the Neu TTS Nano compute requirements and latency.

+
How does Orca Streaming Text-to-Speech compare to Piper TTS?

Piper TTS is an on-device open-source TTS with over 1 GB peak memory and no streaming input support. It falls in the benchmark's highest memory group, meaning it's not a well-suited fit for low-to-mid-range mobile devices or embedded systems. Orca, with its 29 MB peak memory and ability to process streaming input, achieves 130 ms FTTS. For production voice AI agents, Piper TTS introduces latency and memory constraints that Orca does not, making Orca a better fit than Piper TTS.

You can reproduce the open-source benchmark to compare Piper TTS against other alternatives.

+
How does Orca Streaming Text-to-Speech compare to Pocket TTS?

Pocket TTS is an on-device open-source TTS requiring 620 MB peak memory, suitable only for high-end mobile and desktop environments. It does not support streaming input, resulting in 1,670 ms latency, slightly slower than Amazon Polly and Azure TTS but faster than OpenAI TTS, as proven by the open-source TTS benchmark. Pocket TTS is also one of the most efficient TTS options. Its CPU core-hour ratio is 0.38×, still 2.3× more compute than Orca's 0.16×. Since Pocket TTS doesn't support streaming input, as ElevenLabs Streaming and Orca do, it cannot begin synthesizing until the LLM finishes. For voice AI agents where latency and resource efficiency both matter, Orca is the stronger choice across all measured dimensions.

You can reproduce the open-source benchmark to compare Pocket TTS vs other alternatives.

+
How does Orca Streaming Text-to-Speech compare to Soprano TTS?

Soprano TTS needs 710 MB RAM to synthesize speech, limiting it to high-end mobile and desktop deployments. Soprano's architecture does not support streaming input, and requires 5.7× CPU core hours, hindering user experience in real-time applications. For any latency-sensitive application or deployment targeting mid-range or lower devices — Android, iOS, Raspberry Pi, web browser — Orca is a better-suited choice as it supports both streaming input and output, and operates within the sub-30 MB memory threshold, requiring 0.16× CPU core hours.

You can reproduce the open-source benchmark to compare Soprano TTS vs other alternatives.

+
How does Orca Streaming Text-to-Speech compare to Supertonic TTS 2?

Supertonic TTS 2 requires 520 MB peak memory, making it viable for mid-range and above mobile devices but not embedded or web deployments. It does not support streaming input, resulting in 2,450 ms latency. Orca uses 29 MB peak memory, almost 20× less than Supertonic, and supports concurrent synthesis with LLM token generation, starting playing audio back with a 130 ms delay, almost 20× faster than Supertonic.

You can reproduce the open-source benchmark to compare Supertonic TTS vs other alternatives.

+
Which LLMs does Orca Streaming Text-to-Speech support?

Orca Streaming Text-to-Speech works with all closed-source and free and open large language models.

Some examples of closed-source large language models Orca Streaming TTS supports:

  • OpenAI GPTs-4, OpenAI GPT 3.5, OpenAI GPT 3.5 Turbo, OpenAI GPT 3
  • Anthropic Claude, Anthropic Claude 2, Anthropic Claude 3 Sonnet, Anthropic Claude 3 Opus, Anthropic Claude 3 Haiku
  • Cohere Coral

Some examples of open large language models Orca Streaming TTS supports:

  • Llama
  • Falcon
  • Gemini
  • Gemma
  • Grok
  • Mistral
  • Mixtral
  • Phi
  • DBRX
+
Does Orca Streaming Text-to-Speech support async processing?

Yes, Orca has the async processing capability, in other words, it can process predefined, i.e., static, text, and convert it into audio streams or audio recordings. Developers can convert pre-defined text into audio as a complete file or by streaming audio incrementally. Please visit our docs for more information.

+
Which platforms does Orca Streaming Text-to-Speech support?

Orca Streaming Text-to-Speech runs across platforms:

+
Which languages does Orca Streaming Text-to-Speech support?

Orca Streaming Text-to-Speech supports English, French, German, Italian, Japanese, Korean, Portuguese, and Spanish Voice Models. Contact sales with the details of your project if you have an immediate need.

+
How can I generate custom voices using Orca Streaming Text-to-Speech?

Contact sales to explore our white-glove support services, including customization of Orca Streaming Text-to-Speech for brands that want to represent their "voice" via unique, custom voices.

+
Does Orca Streaming Text-to-Speech allow voice tuning?

Orca Streaming Text-to-Speech base model allows developers to adjust the speed of the selected voice. Custom Orca Streaming Text-to-Speech models can be leveraged for further voice tuning. Contact Sales with your project requirements and get a custom Text-to-Speech model that fits your needs.

+
Can I add custom industry jargon and terminology to Orca Streaming Text-to-Speech?

Orca Streaming Text-to-Speech can be application, company, domain, or industry-specific with custom vocabulary.

+
Does Orca Streaming Text-to-Speech generate voices with different emotions?

Custom Orca Streaming Text-to-Speech models generate voices with emotions and styles, including joy, anger, whispering, and shouting. Contact sales with your project requirements to get your custom text-to-speech model trained.

+
How do I get technical support for Orca Streaming Text-to-Speech?

Picovoice docs, blog, Medium posts, and GitHub are great resources to learn about voice AI, Picovoice technology, and how to add AI-generated voice to your product. Enterprise customers get dedicated support specific to their applications from Picovoice Product & Engineering teams. Reach out to your Picovoice contact or contact sales to discuss support options.

+
How can I get informed about updates and upgrades?

Version changes appear in the and LinkedIn. Subscribing to GitHub is the best way to get notified of patch releases. If you enjoy building with Orca Streaming Text-to-Speech, show it by giving a GitHub star!