TLDR: The most effective way to implement real-time text-to-speech (TTS) in Python is using an on-device dual-streaming (or streaming TTS for short) SDK. Running TTS locally eliminates network latency, and streaming allows speech to start immediately without waiting for the full text, delivering a clear performance edge over cloud-based, non-streaming solutions like Amazon Polly, ElevenLabs, or the OpenAI TTS API for real-time applications.
TTS Approaches: Non-Streaming vs Output Streaming vs Dual Streaming
Non-Streaming TTS (Single Synthesis):
- Waits for complete text input before audio synthesis begins
- Generates entire audio file in one pass, then plays it
- Ideal for batch audio file generation or pre-written content
- Higher latency for interactive applications
Output Streaming TTS:
- Waits for complete text input before audio synthesis begins
- Generates and plays audio in chunks simultaneously
- Reduces playback latency but still requires full text upfront
- Limited benefit for real-time applications with incremental text
Streaming TTS (Dual-Streaming):
- Processes text incrementally as it arrives (input streaming)
- Generates and plays audio in chunks without waiting for full text (output streaming)
- Ideal for
LLM-based voice assistants where text arrives token-by-token
Streaming TTS (Dual-Streaming) is the only approach that offers full real-time capability, handling both input and output incrementally, making it the definitive streaming text-to-speech. This is the Streaming TTS we build in this tutorial.
Why Streaming TTS is the Best for Real-Time Apps
Streaming TTS or dual-streaming TTS processes text incrementally as it arrives and generates audio in real-time chunks, enabling speech to start without waiting for complete text input. This approach is essential for voice assistants and conversational AI agents where text streams token-by-token from LLMs.
This tutorial shows you how to implement this streaming TTS in Python using Picovoice's Orca Streaming Text-to-Speech, an on-device voice model that runs locally with no cloud dependency and delivers first audio in ~100ms. You'll learn how to stream text input to synthesized speech output with complete control over voice characteristics, speech rate, and audio formatting for building responsive real-time applications.
On-Device Streaming TTS Advantages over Cloud APIs
Before writing the code, let's understand why Orca Streaming Text-to-Speech delivers superior real-time TTS performance compared to cloud-based TTS. It provides:
- Ultra-low latency streaming: Orca Streaming Text-to-Speech takes ~100 ms for first-token-to-speech, compared to ~1470–2850 ms for many cloud TTS. This makes Orca Streaming Text-to-Speech suitable for real-time voice assistants and conversational agents.
- Faster end-to-end voice assistant responses: In full voice-assistant pipelines with an
LLM, Orca Streaming Text-to-Speech achieved ~170 ms response time versus ~1550–2930 ms for typical cloud pipelines. - Streaming and batch synthesis modes: Generate audio incrementally for real-time
TTSplayback or synthesize full files in a single call. - Multiple voice models: Switch voices by selecting different model files at initialization, enabling different speaking styles and characteristics.
- Speech rate control: Adjust speaking speed to match conversational pacing or accessibility requirements.
- On-device processing: All synthesis runs locally, with no cloud dependency, predictable latency, and no audio leaving the device.
- Flexible audio output: Stream directly to speakers, save to WAV, or process audio chunks programmatically in real time.
View available Orca Streaming Text-to-Speech voices and models in the Orca Github Documentation.
What You'll Build
In this tutorial, you'll implement streaming TTS that:
- Tokenizes text to simulate
LLM-style incremental generation - Synthesizes speech as text arrives token-by-token
- Delivers first audio in ~100ms with continuous playback
- Runs entirely on-device without cloud dependencies
For a deeper look at how real-time TTS fits into agentic systems, see our guide on streaming text-to-speech for AI agents.
Prerequisites
- Python 3.9 or later
- A laptop or desktop with speakers for testing
- A Picovoice
AccessKeyfrom the Picovoice Console
Installing Python Dependencies for Streaming Text-to-Speech
Install the following Python packages using pip:
Building a Real-Time Text-to-Speech App in Python
Understanding the Streaming Process
The code implements real-time text-to-speech in three steps:
- Initialize Orca: Creates the
TTSengine with your PicovoiceAccessKey - Stream synthesis: Text tokens are queued and
orca_stream.synthesize()returns PCM chunks as they're produced - Audio playback: Each chunk plays immediately via
play_audio_callbackwhile synthesis continues for remaining text
This streaming approach reduces perceived latency compared to waiting for complete audio generation.
Initializing the Streaming Engine
Create an Orca instance and open a streaming synthesizer. The streaming interface is what enables incremental synthesis: you can pass partial text as it becomes available, and Orca Streaming Text-to-Speech will return PCM audio chunks when it has enough context to generate speech.
Preparing Audio Playback for Real-Time TTS
Real-time text-to-speech requires audio playback to run alongside synthesis so that speech can be played as soon as it is generated. The example below sets up an audio output stream and forwards PCM chunks from Orca Streaming Text-to-Speech directly to the speaker as they arrive.
The play_audio_callback writes each PCM chunk to the speaker as it arrives. The flush_audio_callback handles draining any remaining audio once synthesis is complete. These callbacks are passed into the synthesis thread, keeping audio playback decoupled from text processing.
Streaming Text into the TTS Engine
To simulate how an LLM emits tokens, we need to break input text into small chunks that arrive incrementally. The tokenize_text function splits text into subword tokens using tiktoken (OpenAI's tokenizer), which mirrors realistic token-by-token delivery from language models.
Why tokenization matters: LLMs don't generate complete sentences at once—they produce tokens sequentially. Our tokenizer replicates this behavior so Orca receives text the same way it would from a real LLM.
If tiktoken is unavailable, a character-level fallback tokenizer is used. In production, replace this with your LLM’s actual output stream.
Now that we can tokenize text incrementally, we need a way to feed these tokens into Orca Streaming Text-to-Speech while simultaneously playing audio. This requires running synthesis and playback concurrently in separate threads.
Generating and Playing Audio Incrementally
The OrcaThread class runs Orca's streaming synthesizer in a background thread. Text tokens are pushed into a queue, synthesized into PCM audio, and buffered before being sent to the audio output. This decouples text input from audio playback, allowing both to run concurrently. The audio_wait_chunks parameter controls how many PCM chunks to buffer before playback begins, which can help smooth audio output on slower devices.
Finalizing Streaming and Cleaning Up
When the text stream ends, call flush() to synthesize any remaining buffered context. Then cleanly close the stream and release resources. This matters for long-running apps and repeated sessions.
Full Real-Time Text-to-Speech Code in Python
This single script combines all aspects of real-time text-to-speech:
Before running the code, replace ${ACCESS_KEY} with your Picovoice AccessKey from the Picovoice Console. You now have a working implementation of streaming text-to-speech (Dual-Streaming) in Python using the Orca Streaming Text-to-Speech.
Common Issues and Solutions
Choppy or stuttering audio?
- Increase
audio_wait_chunksto buffer more audio before playback - Reduce
tokens_per_secondto give Orca Streaming Text-to-Speech more time to synthesize
Audio delayed or not playing?
- Verify your audio device index with
speaker.get_available_devices() - Check that
buffer_size_secsis large enough for your use case
Out of memory errors?
- Reduce
buffer_size_secsif processing very long text streams - Ensure you're calling
flush()anddelete()after each session
What to Build Next
- Integrate with Picovoice's streaming Speech-to-Text (Cheetah Streaming Speech-to-Text) and on-device
LLM(picoLLM) for real-time voice responses - Add wake word detection with Porcupine Wake Word to create hands-free assistants
Related Resources
To build complete voice applications, explore these guides:
- Streaming Text-to-Speech for AI Agents - Learn how streaming
TTSfits into agentic systems - LLM Voice Assistant Example - Voice pipeline combining speech recognition,
LLMs, and streamingTTS - Orca Streaming Text-to-Speech Python Quick Start
- Orca Streaming Text-to-Speech Python API
Conclusion and Key Takeaways
In this tutorial, you learned how to:
- Stream incremental text input into a
TTSengine - Generate audio in real time
- Play speech as text is produced without waiting for full synthesis to complete
This on-device, streaming approach is ideal for LLM-driven voice assistants, conversational agents, and any application where low-latency audio output is critical. By synthesizing speech as text becomes available, Python applications can deliver responsive, natural voice interactions instead of delayed playback.







