Choosing the best audio transcription software is challenging. Although audio transcription software fundamentally converts speech to text, it is used for different use cases and needs. There are several things to be considered while selecting a transcription engine, and they change from one use case to another. Another challenge for enterprises is to decide whether to buy, build or open-source. Although every voice AI vendor claims they have “the best” software, unfortunately, there is no one software that suits every need.

The article covers DeepSpeech, Whisper, Kaldi, SpeechBrain, Vosk, Coqui, Amazon Transcribe, Google Speech-to-Text, Microsoft Azure Speech-to-Text, Nuance, IBM Watson Speech-to-Text and Picovoice’s own Leopard Speech-to-Text.

Free & open-source Speech-to-Text software:

DeepSpeech: Although Mozilla stopped maintaining DeepSpeech, DeepSpeech is still one of the most favorable free and open-source Speech-to-Text software. It is based on Baidu DeepSpeech and implemented by using Tensorflow. DeepSpeech offers reasonably high accuracy and easy trainability with your data.

Whisper: OpenAI announced open-source Whisper Speech-to-Text that processes voice data on-device in September 2022. The number of languages it supports and the additional translation capabilities grabbed developers’ attention. Whisper has five different models of varying sizes and capabilities. Whisper offers on-device speech recognition, but lacks real-time transcription and speaker diarization capabilities. Fine-tuning and maintaining Whisper for a large scale for enterprise applications require an in-house. Given its primary intended users are AI researchers, determining the cost of using Whisper for enterprise applications is crucial before starting to build.

Did you know that you can use Falcon with Whisper to make your transcriptions readable with speaker diarization, or use Cheetah Streaming Speech-to-Text for real-time transcription?

Kaldi: Kaldi is one of the oldest free and open-source speech recognition models and popular engines, especially among researchers and scientists. Although Kaldi is not leveraging the latest deep learning advances, like DeepSpeech, given its relatively good out-of-the-box accuracy and strong community, some enterprises still use Kaldi.

Did you know that Alexa also uses Kaldi? But not wake word detection. Learn why one shouldn’t use an ASR for wake word detection even if it runs on-device.

SpeechBrain: SpeechBrain is a PyTorch-based transcription toolkit that offers tight integration with HuggingFace. The platform is currently in Beta and sponsored by large companies such as Nuance, NVIDIA, and Samsung.

Vosk: Vosk is a free and open-source offline speech recognition API for mobile devices, Raspberry Pi, and servers with Python, Java, C# and Node supporting 20+ languages and achieves model sizes as small as 50 MB.

Coqui: Coqui was founded by former Mozilla DeepSpeech engineers. Coqui’s deep learning-based Speech-to-Text (STT) engines support various pre-trained language models with the support of its community. A few months after Coqui stopped maintaining the speech-to-text library, it ended its operations fully in the first days of 2024. Going forward, enterprises may struggle to get support and enjoy the latest advances in voice AI.

Note: Free and open source engines are ranked based on their GitHub stars.

Enterprises choose open-source models as they’re free to acquire and run on-prem without sending voice data to third-party servers. Yet, they have to invest in training and maintaining the models - which can be more expensive than buying, making buy vs. open-source vs. build a critical decision.

Speech-to-Text APIs

Amazon Transcribe: Amazon Transcribe is a popular Speech-to-Text engine with high accuracy and availability of various features. Developers with limited AWS (Amazon Web Services) experience may not start immediately. Amazon Transcribe offers a custom Speech-to-Text API for the healthcare industry, and the first hour of transcription is free every month for the first year of use, then charges $1.44 per hour.

Do you know you can build a serverless Speech-to-Text Engine with AWS Lambda and Picovoice Leopard?

Google Speech-to-Text: Google Speech-to-Text is another popular audio transcription engine with multi-language support and features. Like AWS, getting started with Google Speech-to-Text without an existing GCP (Google Cloud Platform) account might be complex. GCP offers a $300-worth credit for the first six months. After the credit cost of transcription only may go up to $2.16 per hour if one opts out of data logging. In other words, Google charges extra for not using the audio sent to its cloud.

You can also build a transcription microservice with gRPC and Picovoice Leopard

Microsoft Azure Speech-to-Text: Microsoft Azure is a very accurate Speech-to-Text engine with the flexibility to customize models and multi-language and various feature support. It offers five hours of free transcription per month. However, getting started with Microsoft Azure might be even more difficult than GCP and AWS if one is not familiar with them.

Nuance: Nuance is now a Microsoft company. It is one of the oldest and most known vendors in the speech recognition industry. Its Dragon dictation software has been used by individuals and enterprises widely. Nuance does not offer a free trial of the free tier for its Speech-to-Text API. Nuance Dragon Anywhere, the end-user application, can be tried for free for a week.

IBM Watson Speech-to-Text: Like other cloud providers, IBM Watson Speech-to-Text also comes with multi-language support and various features. Despite the low accuracy, IBM Watson Speech-to-Text is one of the most flexible solutions to adapt generic speech-to-text models. Plus, it offers ease of use and integration if one uses IBM Cloud services.

Note: Enterprise-grade audio transcription software is ranked in alphabetical order.

Most enterprises prefer Speech-to-Text APIs as vendors maintain the models and offer enterprise support. Despite significant drawbacks, such as latency and privacy, enterprises choose them as the models are also easier to fine-tune and use. Picovoice Leopard Speech-to-Text combines the best of both worlds, offering a superior experience. Leopard converts speech to text locally on the preferred platform without sending data to a 3rd party cloud. This platform could be a web browser, mobile application, single-board computer (Raspberry Pi) or server. As a result, Leopard offers accurate, reliable, and private transcription experiences by cutting the voice AI costs by 10 to 100x.

Compare the accuracy of Amazon Transcribe, Google Speech-to-Text, IBM Watson, Microsoft Azure, and Picovoice Leopard with the engine of your choice against open data sets: ​​LibriSpeech, TED-LIUM, and Common Voice in minutes!

If you’re also looking for NLU engines, check out our article on the top NLU engines.