Text-to-Speech (TTS
) technology has come a long way. In the era of Large Language Models (LLMs
), TTS
systems must
be able to handle an input stream of text and convert it to consistent audio in real time.
This streaming TTS
functionality is essential for building responsive voice applications, particularly in the context
of LLM
-based voice assistants that require minimal latency (see our
previous article).
A typical voice assistant system integrates three key components:
- A Streaming Speech-to-Text system (such as Picovoice's Cheetah Streaming Speech-to-Text) for user input.
- A text generator (such as Picovoice's picoLLM Inference) for processing and responding.
- A
TTS
engine that converts the generated text to audio, ideally supporting streaming input and output such as Picovoice's Orca Streaming Text-to-Speech.
In this blog post, we'll focus on the third component and will implement a streaming TTS
system in Python using Orca
Streaming Text-to-Speech.
Dependencies for a Streaming TTS
System
Let's cover the necessary dependencies and setup:
- Install Python: We use version 3.8 or higher. Test whether the installation was successful:
- Install the following Python packages using PIP:
- Sign up for Picovoice Console:
Create a Picovoice Console account and copy your
AccessKey
from the dashboard. Creating an account is free, and no credit card is required.
Building a Simple Streaming TTS
App
Defining a streaming TTS
worker
First, we define an orca_worker
function.
This function sets up the TTS
engine and manages the audio stream. It processes text chunks as they arrive
and plays back the generated audio in real-time. We will run the worker function in a separate process to avoid blocking
the main application.
Note that the Orca stream object processes text chunks one-by-one and returns audio chunks as soon as enough context is
available. At the end of the text stream, Orca will generate the audio of the remaining text that has been
buffered via a flush
command.
Setting up the main process
In the main process, we set up the communication with the Orca worker:
Replace ${ACCESS_KEY}
with your Picovoice Console AccessKey
.
Streaming text input
Here, we simulate asynchronous text generation with a generator function. This part can be replaced with any LLM API call or local model inference.
We then send the text chunks generated by the text stream to the Orca worker.
Flush and wait for completion
After sending all the text, we flush the Orca engine and wait until the audio finishes playing.
Cleaning Up
When we're done, we close the Orca worker:
Time to Start Building
With just a few lines of Python code, we've implemented a streaming TTS
system using
the Orca Streaming Text-to-Speech library. This system can process text in real-time, generating
audio
as the text is being streamed. This approach is
essential for all applications that require low-latency audio generation, such as real-time voice assistants or live
captioning systems.
You can check out LLM Voice Assistant
for a complete working project.
For more information on the Orca library and its capabilities, view the official documentation and start building.
Start Building