Speech Synthesis, or
Voice Generator is the technology that converts written text into spoken words.
Text-to-Speech enables various applications, from voice assistants to language learning. It’s a popular technology with several options in the market. We gathered the most popular ones.
Open-source Text-to-Speech Software
Bark is a text-to-audio model that leverages a transformer architecture. Bark can generate realistic, multilingual speech and more. Background noise, music, or nonverbal communication like laughing, sighing, and crying are other audio types that Bark can generate. Bark is a generative AI model developed for research purposes. Hence, Suno warns its users about unexpected deviations depending on prompts.
Coqui was founded by former Mozilla engineers who used to work on MozillaTTS. It offers open-source libraries and production-grade
TTS models with premium features and API access under paid plans. Coqui has recently created its own license model, the Coqui Public Model License (CPML), which only allows non-commercial use and open-sources the Mozilla data set, not proprietary models.
eSpeak is an open-source
Speech Synthesis software started by Jonathan Duddington in 1995. It is compatible with Windows & Linux and ported to macOSX. eSpeak supports over 100 languages, dialects, and accents.
MaryTTS is a Java-based open-source
Text-to-Speech software with a customizable HTML interface. MaryTTS started as a joint project of DFKI (Deutsches Forschungszentrum für Künstliche Intelligenz - German Research Center for Artificial Intelligence) and the Institute of Phonetics at Saarland University. The Multimodal Speech Processing Group in the Cluster of Excellence MMCI and DFKI continue to maintain it.
Mozilla TTS comes with pre-trained
TTS models and voice samples. It has become popular among developers since its initial release. It’s built with Python. However, since Mozilla stopped maintaining it, MozillaTTS only supports Python versions newer than 3.6 and older than 3.9.
Pico-TTS is a
Text-to-Speech voice synthesizer in the Android Open Source Project (AOSP). Developers and enterprises choose Pico-TTS over alternatives due to its size and on-device deployment option, although it started showing its age.
Tortoise TTS is a
Voice Cloning and Generation software developed by James Betker. It leverages autoregressive and diffusion decoders, generating high-quality results. It can adapt to any speaking style provided after listening to a few minutes of audio. However, the same reason makes Tortoise TTS slow for many applications, such as streaming ones, and ones that don’t run on specialized hardware such as GPUs.
Commercial Text-to-Speech Software
Amazon Polly Text-to-Speech allows developers to generate voice from text in different languages and customize it by adjusting the speaking style, speech rate, or pitch. Amazon Polly Text-to-Speech is a cloud API that processes text input in the cloud and transmits audio output to users’ devices. It has two offerings : Standard
TTS and Neural
TTS. Polly Standard
TTS leverages concatenative synthesis, whereas Neural
TTS leverages neural networks, resulting in more natural and human-like voices.
Google Cloud leverages DeepMind's speech synthesis expertise and allows developers to create hundreds of voices across languages, dialects, and accents. Google Cloud Text-to-Speech offers extra features such as custom voices once engaged with the sales team . Like Amazon Polly, Google Text-to-Speech is a cloud API with standard and premium offerings at different price points.
Descript, a video and audio editing platform, acquired Montreal-based speech synthesis startup Lyrebird in 2019 . Overdub leverages Lyrebird’s technology, allowing users to clone their own voice or use pre-recorded stock voices. Descript focuses on a complete solution, allowing users to write, record, edit, and share content. Overdub is a part of this solution.
Eleven Labs was founded by former Google and Palantir employees in 2022. Despite being relatively new, it grabbed the attention of the media and entertainment industry. Eleven Labs released real-time Text-to-Speech in August 2023 and currently supports 28 languages across various accents. Eleven Labs offers a free plan, but it doesn’t include a commercial license.
IBM Watson offers speech synthesis in multiple languages with the option to create a branded voice at an additional cost similar to other cloud providers, such as Google Text-to-Speech. The default usage model for IBM Watson Text-to-Speech is through an API call. On-prem deployment and data protection require an engagement with the enterprise sales team.
Murf AI :
Murf offers a text-to-speech API similar to the alternatives named here. However, like Descript, Murf offers extra features such as adding images, videos, presentations, and audio files, making their solution appealing for content creators.
Microsoft offers Text-to-Speech under its Azure AI Speech services, allowing developers to build applications with lifelike synthesized speech with intonation and emotion. Azure Text-to-Speech has two offerings: Neural Text-to-Speech and Custom Neural Text-to-Speech. It charges extra for custom voice model training and synthesis with them. Microsoft offers embedded text-to-speech deployment for mobile and desktop applications to selected customers .
Check out our article that reviews TTS Python APIs of Amazon Poly, Google Text-to-Speech, and Microsoft Azure Text-to-Speech to decide which one is easier to use!
ReadSpeaker has offered
Text-to-Speech software to enterprises for over two decades. It trains custom voices and supports 35 languages across various accents. ReadSpeaker allows local deployment. However, it doesn’t offer a free plan or trial for its SDKs or APIs.
Resemble AI creates custom AI voices leveraging
Text-to-Speech and speech-to-speech technologies. Resemble AI offers over 200,00 unique voices and premium features such as emotion control, localization, real-time generation, on-prem, and mobile deployment options. Some of these features are only available for enterprise customers.
Picovoice’s Orca Text-to-Speech, similar to the alternatives, is a
Voice Generator that converts written text into spoken audio output. Orca Text-to-Speech offers flexible deployment options, including on-prem and on-device for the Forever-Free plan users. Enterprise Plan users can enjoy premium features such as voice tuning, custom voices, custom vocabulary, and emotion control by engaging with Picovoice Consulting.