🎯 Voice AI Consulting
Get dedicated support and consultation to ensure your specific needs are met.
Consult an AI Expert

TLDR: The most effective way to implement real-time text-to-speech (TTS) in Python is using an on-device dual-streaming (or streaming TTS for short) SDK. Running TTS locally eliminates network latency, and streaming allows speech to start immediately without waiting for the full text, delivering a clear performance edge over cloud-based, non-streaming solutions like Amazon Polly, ElevenLabs, or the OpenAI TTS API for real-time applications.

TTS Approaches: Non-Streaming vs Output Streaming vs Dual Streaming

Non-Streaming TTS (Single Synthesis):

  • Waits for complete text input before audio synthesis begins
  • Generates entire audio file in one pass, then plays it
  • Ideal for batch audio file generation or pre-written content
  • Higher latency for interactive applications

Output Streaming TTS:

  • Waits for complete text input before audio synthesis begins
  • Generates and plays audio in chunks simultaneously
  • Reduces playback latency but still requires full text upfront
  • Limited benefit for real-time applications with incremental text

Streaming TTS (Dual-Streaming):

  • Processes text incrementally as it arrives (input streaming)
  • Generates and plays audio in chunks without waiting for full text (output streaming)
  • Ideal for LLM-based voice assistants where text arrives token-by-token

Streaming TTS (Dual-Streaming) is the only approach that offers full real-time capability, handling both input and output incrementally, making it the definitive streaming text-to-speech. This is the Streaming TTS we build in this tutorial.

Why Streaming TTS is the Best for Real-Time Apps

Streaming TTS or dual-streaming TTS processes text incrementally as it arrives and generates audio in real-time chunks, enabling speech to start without waiting for complete text input. This approach is essential for voice assistants and conversational AI agents where text streams token-by-token from LLMs.

This tutorial shows you how to implement this streaming TTS in Python using Picovoice's Orca Streaming Text-to-Speech, an on-device voice model that runs locally with no cloud dependency and delivers first audio in ~100ms. You'll learn how to stream text input to synthesized speech output with complete control over voice characteristics, speech rate, and audio formatting for building responsive real-time applications.

On-Device Streaming TTS Advantages over Cloud APIs

Before writing the code, let's understand why Orca Streaming Text-to-Speech delivers superior real-time TTS performance compared to cloud-based TTS. It provides:

  • Ultra-low latency streaming: Orca Streaming Text-to-Speech takes ~100 ms for first-token-to-speech, compared to ~1470–2850 ms for many cloud TTS. This makes Orca Streaming Text-to-Speech suitable for real-time voice assistants and conversational agents.
  • Faster end-to-end voice assistant responses: In full voice-assistant pipelines with an LLM, Orca Streaming Text-to-Speech achieved ~170 ms response time versus ~1550–2930 ms for typical cloud pipelines.
  • Streaming and batch synthesis modes: Generate audio incrementally for real-time TTS playback or synthesize full files in a single call.
  • Multiple voice models: Switch voices by selecting different model files at initialization, enabling different speaking styles and characteristics.
  • Speech rate control: Adjust speaking speed to match conversational pacing or accessibility requirements.
  • On-device processing: All synthesis runs locally, with no cloud dependency, predictable latency, and no audio leaving the device.
  • Flexible audio output: Stream directly to speakers, save to WAV, or process audio chunks programmatically in real time.

View available Orca Streaming Text-to-Speech voices and models in the Orca Github Documentation.

What You'll Build

In this tutorial, you'll implement streaming TTS that:

  • Tokenizes text to simulate LLM-style incremental generation
  • Synthesizes speech as text arrives token-by-token
  • Delivers first audio in ~100ms with continuous playback
  • Runs entirely on-device without cloud dependencies

For a deeper look at how real-time TTS fits into agentic systems, see our guide on streaming text-to-speech for AI agents.

Prerequisites

  • Python 3.9 or later
  • A laptop or desktop with speakers for testing
  • A Picovoice AccessKey from the Picovoice Console

Installing Python Dependencies for Streaming Text-to-Speech

Install the following Python packages using pip:

Building a Real-Time Text-to-Speech App in Python

Understanding the Streaming Process

The code implements real-time text-to-speech in three steps:

  1. Initialize Orca: Creates the TTS engine with your Picovoice AccessKey
  2. Stream synthesis: Text tokens are queued and orca_stream.synthesize() returns PCM chunks as they're produced
  3. Audio playback: Each chunk plays immediately via play_audio_callback while synthesis continues for remaining text

This streaming approach reduces perceived latency compared to waiting for complete audio generation.

Initializing the Streaming Engine

Create an Orca instance and open a streaming synthesizer. The streaming interface is what enables incremental synthesis: you can pass partial text as it becomes available, and Orca Streaming Text-to-Speech will return PCM audio chunks when it has enough context to generate speech.

Preparing Audio Playback for Real-Time TTS

Real-time text-to-speech requires audio playback to run alongside synthesis so that speech can be played as soon as it is generated. The example below sets up an audio output stream and forwards PCM chunks from Orca Streaming Text-to-Speech directly to the speaker as they arrive.

The play_audio_callback writes each PCM chunk to the speaker as it arrives. The flush_audio_callback handles draining any remaining audio once synthesis is complete. These callbacks are passed into the synthesis thread, keeping audio playback decoupled from text processing.

Streaming Text into the TTS Engine

To simulate how an LLM emits tokens, we need to break input text into small chunks that arrive incrementally. The tokenize_text function splits text into subword tokens using tiktoken (OpenAI's tokenizer), which mirrors realistic token-by-token delivery from language models.

Why tokenization matters: LLMs don't generate complete sentences at once—they produce tokens sequentially. Our tokenizer replicates this behavior so Orca receives text the same way it would from a real LLM.

If tiktoken is unavailable, a character-level fallback tokenizer is used. In production, replace this with your LLM’s actual output stream.

Now that we can tokenize text incrementally, we need a way to feed these tokens into Orca Streaming Text-to-Speech while simultaneously playing audio. This requires running synthesis and playback concurrently in separate threads.

Generating and Playing Audio Incrementally

The OrcaThread class runs Orca's streaming synthesizer in a background thread. Text tokens are pushed into a queue, synthesized into PCM audio, and buffered before being sent to the audio output. This decouples text input from audio playback, allowing both to run concurrently. The audio_wait_chunks parameter controls how many PCM chunks to buffer before playback begins, which can help smooth audio output on slower devices.

Finalizing Streaming and Cleaning Up

When the text stream ends, call flush() to synthesize any remaining buffered context. Then cleanly close the stream and release resources. This matters for long-running apps and repeated sessions.

Full Real-Time Text-to-Speech Code in Python

This single script combines all aspects of real-time text-to-speech:

Before running the code, replace ${ACCESS_KEY} with your Picovoice AccessKey from the Picovoice Console. You now have a working implementation of streaming text-to-speech (Dual-Streaming) in Python using the Orca Streaming Text-to-Speech.

Common Issues and Solutions

Choppy or stuttering audio?

  • Increase audio_wait_chunks to buffer more audio before playback
  • Reduce tokens_per_second to give Orca Streaming Text-to-Speech more time to synthesize

Audio delayed or not playing?

  • Verify your audio device index with speaker.get_available_devices()
  • Check that buffer_size_secs is large enough for your use case

Out of memory errors?

  • Reduce buffer_size_secs if processing very long text streams
  • Ensure you're calling flush() and delete() after each session

What to Build Next

To build complete voice applications, explore these guides:

Conclusion and Key Takeaways

In this tutorial, you learned how to:

  • Stream incremental text input into a TTS engine
  • Generate audio in real time
  • Play speech as text is produced without waiting for full synthesis to complete

This on-device, streaming approach is ideal for LLM-driven voice assistants, conversational agents, and any application where low-latency audio output is critical. By synthesizing speech as text becomes available, Python applications can deliver responsive, natural voice interactions instead of delayed playback.

Start Building

Frequently Asked Questions

What is streaming TTS?
Streaming TTS refers to text-to-speech that generates audio continuously rather than waiting to create a complete audio file. However, there are two types: output streaming (which still requires complete text input) and dual streaming (which processes text incrementally as it arrives). For real-time applications like voice assistants, dual-streaming TTS enables speech to start immediately as text arrives token-by-token.
How do I implement real-time TTS in Python?
For the best performance in real-time applications, use a dual-streaming TTS SDK that runs locally. Orca Streaming Text-to-Speech is designed for this use case. It begins synthesizing audio as soon as text arrives and delivers first speech output in approximately 100ms.
How to enable TTS on stream?
To enable TTS on stream, integrate a text-to-speech SDK into your streaming application that processes text in real-time. Dual-streaming TTS model, Orca Streaming Text-to-Speech is available across multiple platforms (Python, Web, iOS, Android etc.) and delivers audio within ~100ms of receiving text. The key is using local processing to avoid cloud API latency that would create noticeable delays between text input and voice output during live streams or interactive applications.
What are the most common TTS technologies?
Common TTS technologies include non-streaming (single synthesis), output streaming, and dual-streaming approaches, each suited to different application needs. Non-streaming TTS generates complete audio files from text input, output streaming plays audio progressively, and dual-streaming processes text incrementally for real-time speech. Orca Streaming Text-to-Speech supports non-streaming, output streaming and dual-streaming modes across multiple platforms (Python, JavaScript, Android, iOS etc.), giving developers flexibility to choose the right approach for their specific use case.