🎯 Voice AI Consulting

Get dedicated support and consultation to ensure your specific needs are met.

Text-to-Speech (TTS) technology has come a long way. In the era of Large Language Models (LLMs), TTS systems must be able to handle an input stream of text and convert it to consistent audio in real time. This streaming TTS functionality is essential for building responsive voice applications, particularly in the context of LLM-based voice assistants that require minimal latency (see our previous article). A typical voice assistant system integrates three key components:

A Streaming Speech-to-Text system (such as Picovoice's Cheetah Streaming Speech-to-Text) for user input.
A text generator (such as Picovoice's picoLLM Inference) for processing and responding.
A TTS engine that converts the generated text to audio, ideally supporting streaming input and output such as Picovoice's Orca Streaming Text-to-Speech.

In this blog post, we'll focus on the third component and will implement a streaming TTS system in Python using Orca Streaming Text-to-Speech.

Dependencies for a Streaming `TTS` System

Let's cover the necessary dependencies and setup:

Install Python: We use version 3.8 or higher. Test whether the installation was successful:

python --version

Install the following Python packages using PIP:

pip install pvorca 
pip install sounddevice 
pip install numpy

Sign up for Picovoice Console: Create a Picovoice Console account and copy your AccessKey from the dashboard. Creating an account is free, and no credit card is required.

Building a Simple Streaming `TTS` App

Defining a streaming `TTS` worker

First, we define an orca_worker function. This function sets up the TTS engine and manages the audio stream. It processes text chunks as they arrive and plays back the generated audio in real-time. We will run the worker function in a separate process to avoid blocking the main application.

def orca_worker(access_key: str, connection: Any, stream_frame_sec: int = 0.03) -> None:
    from sounddevice import OutputStream

    # initialize Orca with AccessKey from Picovoice Console (console.picovoice.ai)
    orca = pvorca.create(access_key=access_key)
    orca_stream = orca.stream_open()

    texts = list()
    pcm_buffer = list()
    synthesize = False
    flush = False
    close = False

    def output_stream_callback(data, _, __, ___) -> None:
        if len(pcm_buffer) < data.shape[0]:
            pcm_buffer.extend([0] * (data.shape[0] - len(pcm_buffer)))

        data[:, 0] = pcm_buffer[:data.shape[0]]
        del pcm_buffer[:data.shape[0]]

    stream = OutputStream(
        samplerate=orca.sample_rate,
        blocksize=int(stream_frame_sec * orca.sample_rate),
        channels=1,
        dtype='int16',
        callback=output_stream_callback)

    # utility function to buffer the PCM data
    def buffer_pcm(x: Optional[Sequence[int]]) -> None:
        if x is not None:
            pcm_buffer.extend(x)

    while True:
        if synthesize and len(texts) > 0:
            pcm = orca_stream.synthesize(texts.pop(0))
            buffer_pcm(pcm)
        elif flush:
            while len(texts) > 0:
                pcm = orca_stream.synthesize(texts.pop(0))
                buffer_pcm(pcm)
            pcm = orca_stream.flush()
            buffer_pcm(pcm)
            # send outbound message to main process to indicate that the buffer has been flushed
            connection.send({'flushed': True})
            flush = False
            while len(pcm_buffer) > 0:
                time.sleep(stream_frame_sec)
            stream.stop()
            connection.send({'done': True})
        elif close:
            break
        else:
            time.sleep(stream_frame_sec)

        # process inbound messages from main process to switch commands
        while connection.poll():
            message = connection.recv()
            if message['command'] == 'synthesize':
                texts.append(message['text'])
                if not stream.active:
                    stream.start()
                synthesize = True
            elif message['command'] == 'flush':
                synthesize = False
                flush = True
            elif message['command'] == 'close':
                close = True

    stream.close()
    orca_stream.close()
    orca.delete()

Note that the Orca stream object processes text chunks one-by-one and returns audio chunks as soon as enough context is available. At the end of the text stream, Orca will generate the audio of the remaining text that has been buffered via a flush command.

Setting up the main process

In the main process, we set up the communication with the Orca worker:

main_connection, orca_process_connection = Pipe()
orca_process = Process(target=orca_worker, args=('${ACCESS_KEY}', orca_process_connection))
orca_process.start()

Replace ${ACCESS_KEY} with your Picovoice Console AccessKey.

Streaming text input

Here, we simulate asynchronous text generation with a generator function. This part can be replaced with any LLM API call or local model inference.

# Replace this with an actual LLM API call or local model inference
def text_stream() -> Generator[str, None, None]:
    sentence = """
    Welcome to our demonstration of streaming text-to-speech technology. 
    This sentence is being converted to audio in real-time as you listen.
    """
    for word in sentence.split():
        time.sleep(0.05)
        yield f"{word} "

We then send the text chunks generated by the text stream to the Orca worker.

for text in text_stream():
    main_connection.send({'command': 'synthesize', 'text': text})
    print(text, end='', flush=True)

Flush and wait for completion

After sending all the text, we flush the Orca engine and wait until the audio finishes playing.

main_connection.send({'command': 'flush'})

# Wait until the remaining text is synthesized
while not main_connection.poll():
    time.sleep(0.01)

assert main_connection.recv()['flushed']

# Wait until the audio finishes playing
while not main_connection.poll():
    time.sleep(0.01)

assert main_connection.recv()['done']

Cleaning Up

When we're done, we close the Orca worker:

main_connection.send({'command': 'close'})
orca_process.join()

Time to Start Building

With just a few lines of Python code, we've implemented a streaming TTS system using the Orca Streaming Text-to-Speech library. This system can process text in real-time, generating audio as the text is being streamed. This approach is essential for all applications that require low-latency audio generation, such as real-time voice assistants or live captioning systems. You can check out LLM Voice Assistant for a complete working project.

For more information on the Orca library and its capabilities, view the official documentation and start building.

Start Building

Streaming Text-to-Speech in Python

Dependencies for a Streaming `TTS` System

Building a Simple Streaming `TTS` App

Defining a streaming `TTS` worker

Setting up the main process

Streaming text input

Flush and wait for completion

Cleaning Up

Time to Start Building

More from Picovoice

Streaming Text-to-Speech in Python

Dependencies for a Streaming TTS System

Building a Simple Streaming TTS App

Defining a streaming TTS worker

Setting up the main process

Streaming text input

Flush and wait for completion

Cleaning Up

Time to Start Building

More from Picovoice

Dependencies for a Streaming `TTS` System

Building a Simple Streaming `TTS` App

Defining a streaming `TTS` worker