Text-to-speech (TTS) converts written text to synthesized speech, enabling various voice interfaces and applications such as virtual assistants, audiobooks, and accessibility tools. In our previous post, we explored on-device TTS solutions that allow for synthesizing speech directly on a user's device. This blog compares top cloud-based TTS systems that process text input in the cloud and transmit audio output back to users’ devices. For use cases where internet connectivity and privacy are not a concern, cloud-based TTS systems offer higher quality voices across various languages and accents.

In this article, we will explore the TTS Python APIs of major tech companies including Google Cloud Text-to-Speech, Amazon Polly, or Microsoft Text-to-Speech.

Google Cloud Text-to-Speech

Google Cloud Text-to-Speech leverages neural network models created by DeepMind and supports hundreds of voices across languages, dialects, and accents. To get started you will need to sign up for a Google Cloud Platform account, create a new project, and set up your credentials. Refer to the documentation for more details. To synthesize speech, follow the steps:

  1. Install the Google TTS Python library:
  1. Import the library and create a client:
  1. Use the following function to synthesize speech from text:

To learn about the available voices and languages, visit Google's documentation.

Amazon Polly

Amazon Polly Text-to-Speech has two offerings: Standard TTS and Neural TTS. Polly Standard TTS leverages concatenative synthesis, whereas Neural TTS leverages neural networks, resulting in more natural and human-like voices.

To get started, create an AWS account and set up your credentials. Then, follow the steps:

  1. Install the Amazon Polly Python library:
  1. Import the library and create a client in Python. YourProfileName corresponds to the name of your AWS profile account:
  1. Synthesize speech with the following function:

Check Amazon's documentation for the available voices.

Microsoft Azure TTS

Microsoft Text-to-Speech offers Text-to-Speech under its Azure AI Speech services, and has similar offerings as Google and Amazon to synthesize speech in a variety of languages, voices, and dialects. They also focus on training custom voice models.

To synthesize speech, you first need to sign up for an Azure account and create a speech resource in the Azure portal. Then, follow the steps:

  1. Install the Microsoft Azure Python library:
  1. Set environment variables SPEECH_KEY and SPEECH_REGION to the ones created in your speech resource in the Azure portal.
  2. Import the library in Python:
  1. Use the following function to synthesize speech from text:

For available languages and voices, check Microsoft's documentation.

Conclusion

In summary, the top cloud providers offer high-quality TTS services accessible via Python. They come in standard voices and high-quality neural voices at different price points. On top of the examples above, the APIs also allow adjusting speech parameters like rate, pitch, and speaking style, and support Speech Synthesis Markup Language (SSML) for fine-tuned speech synthesis control. They also offer support for creating custom voices by engaging with the respective sales teams.