🎯 Voice AI Consulting
Get dedicated support and consultation to ensure your specific needs are met.
Consult an AI Expert

Text-to-Speech (TTS) technology has come a long way. In the era of Large Language Models (LLMs), TTS systems must be able to handle an input stream of text and convert it to consistent audio in real time. This streaming TTS functionality is essential for building responsive voice applications, particularly in the context of LLM-based voice assistants that require minimal latency (see our previous article). A typical voice assistant system integrates three key components:

  1. A Streaming Speech-to-Text system (such as Picovoice's Cheetah Streaming Speech-to-Text) for user input.
  2. A text generator (such as Picovoice's picoLLM Inference) for processing and responding.
  3. A TTS engine that converts the generated text to audio, ideally supporting streaming input and output such as Picovoice's Orca Streaming Text-to-Speech.

In this blog post, we'll focus on the third component and will implement a streaming TTS system in Python using Orca Streaming Text-to-Speech.

Dependencies for a Streaming TTS System

Let's cover the necessary dependencies and setup:

  1. Install Python: We use version 3.8 or higher. Test whether the installation was successful:
  1. Install the following Python packages using PIP:
  1. Sign up for Picovoice Console: Create a Picovoice Console account and copy your AccessKey from the dashboard. Creating an account is free, and no credit card is required.

Building a Simple Streaming TTS App

Defining a streaming TTS worker

First, we define an orca_worker function. This function sets up the TTS engine and manages the audio stream. It processes text chunks as they arrive and plays back the generated audio in real-time. We will run the worker function in a separate process to avoid blocking the main application.

Note that the Orca stream object processes text chunks one-by-one and returns audio chunks as soon as enough context is available. At the end of the text stream, Orca will generate the audio of the remaining text that has been buffered via a flush command.

Setting up the main process

In the main process, we set up the communication with the Orca worker:

Replace ${ACCESS_KEY} with your Picovoice Console AccessKey.

Streaming text input

Here, we simulate asynchronous text generation with a generator function. This part can be replaced with any LLM API call or local model inference.

We then send the text chunks generated by the text stream to the Orca worker.

Flush and wait for completion

After sending all the text, we flush the Orca engine and wait until the audio finishes playing.

Cleaning Up

When we're done, we close the Orca worker:

Time to Start Building

With just a few lines of Python code, we've implemented a streaming TTS system using the Orca Streaming Text-to-Speech library. This system can process text in real-time, generating audio as the text is being streamed. This approach is essential for all applications that require low-latency audio generation, such as real-time voice assistants or live captioning systems. You can check out LLM Voice Assistant for a complete working project.

For more information on the Orca library and its capabilities, view the official documentation and start building.

Start Building