Text-to-Speech in Python: Cloud Solutions

🚀 Best-in-class Voice AI!

Build compliant and low-latency AI apps using Python without sending user data to 3rd party servers.

Text-to-speech (TTS) converts written text to synthesized speech, enabling various voice interfaces and applications such as virtual assistants, audiobooks, and accessibility tools. In our previous post, we explored on-device TTS solutions that allow for synthesizing speech directly on a user's device. This blog compares top cloud-based TTS systems that process text input in the cloud and transmit audio output back to users’ devices. For use cases where internet connectivity and privacy are not a concern, cloud-based TTS systems offer higher quality voices across various languages and accents.

In this article, we will explore the TTS Python APIs of major tech companies including Google Cloud Text-to-Speech, Amazon Polly, or Microsoft Text-to-Speech.

Google Cloud Text-to-Speech

Google Cloud Text-to-Speech leverages neural network models created by DeepMind and supports hundreds of voices across languages, dialects, and accents. To get started you will need to sign up for a Google Cloud Platform account, create a new project, and set up your credentials. Refer to the documentation for more details. To synthesize speech, follow the steps:

Install the Google TTS Python library:

pip install google-cloud-texttospeech

Import the library and create a client:

import google.cloud.texttospeech as tts

client = tts.TextToSpeechClient()

Use the following function to synthesize speech from text:

def google_tts(text: str, output_path: str, voice_name: str) -> None:
    language_code = "-".join(voice_name.split("-")[:2])
    text_input = tts.SynthesisInput(text=text)
    voice_params = tts.VoiceSelectionParams(language_code=language_code, name=voice_name)
    audio_config = tts.AudioConfig(audio_encoding=tts.AudioEncoding.LINEAR16)

    response = client.synthesize_speech(
        input=text_input,
        voice=voice_params,
        audio_config=audio_config)

    try:
        with open(output_path, "wb") as out:
            out.write(response.audio_content)
    except:
        # Handle error
        pass

To learn about the available voices and languages, visit Google's documentation.

Amazon Polly

Amazon Polly Text-to-Speech has two offerings: Standard TTS and Neural TTS. Polly Standard TTS leverages concatenative synthesis, whereas Neural TTS leverages neural networks, resulting in more natural and human-like voices.

To get started, create an AWS account and set up your credentials. Then, follow the steps:

Install the Amazon Polly Python library:

pip install boto3

Import the library and create a client in Python. YourProfileName corresponds to the name of your AWS profile account:

import boto3

session = boto3.Session(profile_name=YourProfileName)
polly_client = session.client("polly")

Synthesize speech with the following function:

def aws_tts(text: str, output_path: str, voice_name: str) -> None:
    response = polly_client.synthesize_speech(Text=text, OutputFormat="mp3", VoiceId=voice_name)

    try:
        with open(output_path, "wb") as file:
            file.write(response["AudioStream"].read())
    except:
        # Handle error
        pass

Check Amazon's documentation for the available voices.

Microsoft Azure TTS

Microsoft Text-to-Speech offers Text-to-Speech under its Azure AI Speech services, and has similar offerings as Google and Amazon to synthesize speech in a variety of languages, voices, and dialects. They also focus on training custom voice models.

To synthesize speech, you first need to sign up for an Azure account and create a speech resource in the Azure portal. Then, follow the steps:

Install the Microsoft Azure Python library:

pip install azure-cognitiveservices-speech

Set environment variables SPEECH_KEY and SPEECH_REGION to the ones created in your speech resource in the Azure portal.
Import the library in Python:

import os
import azure.cognitiveservices.speech as speechsdk

Use the following function to synthesize speech from text:

def azure_tts(text: str, output_path: str, voice_name: str) -> None:
    speech_config = speechsdk.SpeechConfig(
        subscription=os.environ.get('SPEECH_KEY'),
        region=os.environ.get('SPEECH_REGION'))
    audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)

    speech_config.speech_synthesis_voice_name = voice_name
    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)

    speech_synthesis_result = speech_synthesizer.speak_text_async(text).get()

    if speech_synthesis_result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
        with open(output_path, "wb") as file:
            file.write(speech_synthesis_result.audio_data)
    else:
        # Handle error
        pass

For available languages and voices, check Microsoft's documentation.

Conclusion

In summary, the top cloud providers offer high-quality TTS services accessible via Python. They come in standard voices and high-quality neural voices at different price points. On top of the examples above, the APIs also allow adjusting speech parameters like rate, pitch, and speaking style, and support Speech Synthesis Markup Language (SSML) for fine-tuned speech synthesis control. They also offer support for creating custom voices by engaging with the respective sales teams.

Text-to-Speech in Python: Cloud Solutions

Google Cloud Text-to-Speech

Amazon Polly

Microsoft Azure TTS

Conclusion

More from Picovoice