Text-to-Speech APIs and SDKs

🎯 On-Device Voice AI for Enterprises

Get dedicated support to ensure your specific needs are met.

Text-to-Speech, TTS, Speech Synthesis, or Voice Generator is the technology that converts written text into spoken words. Text-to-Speech enables various applications, from voice assistants to language learning. It’s a popular technology with several options in the market. We gathered the most popular ones.

Open-source Text-to-Speech Software

Bark by Suno:

Bark is a text-to-audio model that leverages a transformer architecture. Bark can generate realistic, multilingual speech and more. Background noise, music, or nonverbal communication like laughing, sighing, and crying are other audio types that Bark can generate. Bark is a generative AI model developed for research purposes. Hence, Suno warns its users about unexpected deviations depending on prompts.

eSpeak:

eSpeak is an open-source Speech Synthesis software started by Jonathan Duddington in 1995. It is compatible with Windows & Linux and ported to macOSX. eSpeak supports over 100 languages, dialects, and accents.

MaryTTS:

MaryTTS is a Java-based open-source Text-to-Speech software with a customizable HTML interface. MaryTTS started as a joint project of DFKI (Deutsches Forschungszentrum für Künstliche Intelligenz - German Research Center for Artificial Intelligence) and the Institute of Phonetics at Saarland University. The Multimodal Speech Processing Group in the Cluster of Excellence MMCI and DFKI continue to maintain it.

Mozilla TTS:

Mozilla TTS comes with pre-trained TTS models and voice samples. It has become popular among developers since its initial release. It’s built with Python. However, since Mozilla stopped maintaining it, MozillaTTS only supports Python versions newer than 3.6 and older than 3.9.

Pico-TTS:

Pico-TTS is a Text-to-Speech voice synthesizer in the Android Open Source Project (AOSP). Developers and enterprises choose Pico-TTS over alternatives due to its size and on-device deployment option, although it started showing its age.

TorToiSe TTS:

Tortoise TTS is a Voice Cloning and Generation software developed by James Betker. It leverages autoregressive and diffusion decoders, generating high-quality results. It can adapt to any speaking style provided after listening to a few minutes of audio. However, the same reason makes Tortoise TTS slow for many applications, such as streaming ones, and ones that don’t run on specialized hardware such as GPUs.

Commercial Text-to-Speech Software

Amazon Polly:

Amazon Polly Text-to-Speech allows developers to generate voice from text in different languages and customize it by adjusting the speaking style, speech rate, or pitch. Amazon Polly Text-to-Speech is a cloud API that processes text input in the cloud and transmits audio output to users’ devices. It has two offerings: Standard TTS and Neural TTS. Polly Standard TTS leverages concatenative synthesis, whereas Neural TTS leverages neural networks, resulting in more natural and human-like voices.

Google Cloud Text-to-Speech:

Google Cloud leverages DeepMind's speech synthesis expertise and allows developers to create hundreds of voices across languages, dialects, and accents. Google Cloud Text-to-Speech offers extra features such as custom voices once engaged with the sales team. Like Amazon Polly, Google Text-to-Speech is a cloud API with standard and premium offerings at different price points.

Descript Overdub:

Descript, a video and audio editing platform, acquired Montreal-based speech synthesis startup Lyrebird in 2019. Overdub leverages Lyrebird’s technology, allowing users to clone their own voice or use pre-recorded stock voices. Descript focuses on a complete solution, allowing users to write, record, edit, and share content. Overdub is a part of this solution.

Eleven Labs:

Eleven Labs was founded by former Google and Palantir employees in 2022. Despite being relatively new, it grabbed the attention of the media and entertainment industry. Eleven Labs released real-time Text-to-Speech in August 2023 and currently supports 28 languages across various accents. Eleven Labs offers a free plan, but it doesn’t include a commercial license.

IBM Watson Text-to-Speech:

IBM Watson offers speech synthesis in multiple languages with the option to create a branded voice at an additional cost similar to other cloud providers, such as Google Text-to-Speech. The default usage model for IBM Watson Text-to-Speech is through an API call. On-prem deployment and data protection require an engagement with the enterprise sales team.

Murf AI:

Murf offers a text-to-speech API similar to the alternatives named here. However, like Descript, Murf offers extra features such as adding images, videos, presentations, and audio files, making their solution appealing for content creators.

Microsoft Azure Text-to-Speech:

Microsoft offers Text-to-Speech under its Azure AI Speech services, allowing developers to build applications with lifelike synthesized speech with intonation and emotion. Azure Text-to-Speech has two offerings: Neural Text-to-Speech and Custom Neural Text-to-Speech. It charges extra for custom voice model training and synthesis with them. Microsoft offers embedded text-to-speech deployment for mobile and desktop applications to selected customers.

Check out our article that reviews TTS Python APIs of Amazon Poly, Google Text-to-Speech, and Microsoft Azure Text-to-Speech to decide which one is easier to use!

ReadSpeaker:

ReadSpeaker has offered Text-to-Speech software to enterprises for over two decades. It trains custom voices and supports 35 languages across various accents. ReadSpeaker allows local deployment. However, it doesn’t offer a free plan or trial for its SDKs or APIs.

Resemble AI:

Resemble AI creates custom AI voices leveraging Text-to-Speech and speech-to-speech technologies. Resemble AI offers over 200,00 unique voices and premium features such as emotion control, localization, real-time generation, on-prem, and mobile deployment options. Some of these features are only available for enterprise customers.

Picovoice Orca Text-to-Speech:

Picovoice’s Orca Text-to-Speech, similar to the alternatives, is a Voice Generator that converts written text into spoken audio output. Orca Text-to-Speech offers flexible deployment options, including on-prem and on-device. Enterprises can access premium features such as voice tuning, custom voices, custom vocabulary, and emotion control through white-glove services.

Talk to Sales