Text-to-Speech in Python: On-Device Solutions

🚀 Best-in-class Voice AI!

Build compliant and low-latency AI apps using Python without sending user data to 3rd party servers.

Text-to-Speech (TTS) technology, also known as Speech Synthesis, converts text into human-like speech. The rise of deep learning has led to major advancements in TTS quality and naturalness, but at the cost of increased computational requirements. Most big tech companies offer cloud-based TTS APIs, like Google Text-to-Speech, Amazon Polly, or Microsoft Text-to-Speech, and new companies with similar offerings have emerged, such as ElevenLabs, or Coqui Studio. While convenient, these services require an internet connection, raise privacy concerns, and are prone to network outages. On-device solutions allow for more flexibility and privacy by synthesizing speech directly on the user's device. However, few options exist for on-device TTS. This article explores three open-source Python libraries and Picovoice Orca Text-to-Speech.

🚀 Best-in-class Voice AI!

Build compliant and low-latency AI apps using Python without sending user data to 3rd party servers.

Start Free

PyTTSx3

PyTTSx3 is a Python library that utilizes the popular eSpeak speech synthesis engine on Linux (NSSpeechSynthesizer is used on MacOS and SAPI5 on Windows). Getting started is straightforward:

Install pyTTSx3:

pip install pyttsx3

Save synthesized speech to a file in Python:

import pyttsx3

engine = pyttsx3.init()
engine.save_to_file(text='Hello World', filename='PATH/TO/OUTPUT.wav')
engine.runAndWait()

While simple to use, eSpeak's voice quality is robotic compared to more modern TTS systems.

Coqui TTS

Coqui TTS is the open-source repository of Coqui Studio. Developers can leverage Coqui's pretrained models or train custom voices. To synthesize speech, follow the steps:

Install Coqui TTS:

pip install TTS

List available models in Python:

from TTS.api import TTS

TTS().list_models()

Choose a model name and save synthesized speech to a file:

tts = TTS("CHOSEN/MODEL/NAME")
tts.tts_to_file(text="Hello World", output_path="PATH/TO/OUTPUT.wav")

Coqui offers high-quality voices with natural prosody, at the cost of larger model sizes and longer processing times.

Mimic3 from Mycroft

Mycroft is a free and open-source virtual assistant that offers a TTS system called Mimic3. This framework currently lacks a pure Python API, so we will use Python's subprocess:

Install Mycroft:

pip install mycroft-mimic3-tts

Synthesize speech and save file to directory OUTPUT/DIR:

import subprocess

args = [
    "mimic3",
    "\"Hello World\"",
    "--output-dir", "OUTPUT/DIR"]
try:
    subprocess.check_call(args)
except subprocess.CalledProcessError as e:
    # Handle error
    pass

For prototyping on-device TTS, Mimic3 from Mycroft provides a balance of quality and performance.

Orca Text-to-Speech

Picovoice Orca Text-to-Speech leverages state-of-the-art Text-to-Speech (TTS) models to provide high-quality voices, while still being small and efficient.

Install Orca Text-to-Speech Python SDK.

pip install pvorca

Import Orca and create an Orca instance.

import pvorca

orca = pvorca.create(access_key="${ACCESS_KEY}")

Sign-up or Log in to Picovoice Console to copy your access key and replace ${ACCESS_KEY} with it.

Synthesize your desired text.

orca.synthesize(text="${TEXT}")

For more information refer to the Orca Text-to-Speech Python SDK Documentation.

Conclusion

On-device TTS removes privacy concerns, internet requirements, and minimizes latency. With Python solutions like PyTTSx3, Coqui TTS, and Mimic3, developers have several options for synthesizing speech directly on devices based on their needs. However, each solution comes with drawbacks such as poor voice quality, large resource requirements, or lack of flexible APIs. Another alternative is Orca Text-to-Speech, which combines state-of-the-art neural TTS with efficiency, allowing to synthesize high-quality speech even on a Raspberry Pi.