Choosing the best audio transcription software is difficult. First and foremost, audio transcription software is used for different use cases and needs. Although every voice vendor claims that they have “the best” software, unfortunately, there is no single software that suits every need. So there is no “the best” speech-to-text software currently available.
After the positive reaction that Picovoice’s open-source speech-to-text benchmark received, first we decided to list the top seven factors that one may want to consider while selecting speech-to-text software. Now, we evaluated top FOSS (free and open source software) and enterprise-grade audio transcription engines transparently. The article covers DeepSpeech, Kaldi, SpeechBrain, Vosk, Coqui, Amazon Transcribe, Google Speech-to-Text, Microsoft Azure Speech-to-Text, Nuance, IBM Watson Speech-to-Text and Picovoice’s own Leopard Speech-to-Text.
Top free & open-source speech-to-text software:
DeepSpeech: Although Mozilla stopped maintaining DeepSpeech, DeepSpeech is still one of the most favourable free and open-source Speech-to-Text software. It is based on Baidu Deep Speech and implemented by using TensorFlow. DeepSpeech offers reasonably high accuracy and easy trainability with your data.
Kaldi: Kaldi is one of the oldest free and open-source speech recognition models and popular engines, especially among researchers and scientists. Although Kaldi is not leveraging the latest deep learning advances, like DeepSpeech, given its relatively good out-of-the-box accuracy and strong community Kaldi has been used by various enterprises as well.
SpeechBrain: SpeechBrain is a PyTorch-based transcription toolkit that offers tight integration with HuggingFace. The platform is currently in Beta and sponsored by large companies such as Nuance, NVIDIA and Samsung.
Vosk: Vosk is a free and open-source offline speech recognition API for mobile devices, Raspberry Pi and servers with Python, Java, C# and Node supporting 20+ languages and achieves model sizes as small as 50 MB.
Coqui: Coqui is founded by former Mozilla DeepSpeech engineers. Coqui’s deep learning-based Speech-to-Text (STT) engines support various pre-trained language models with the support of its community. With TensorFlow Lite, Coqui reduced the English model size to 47 MB to make it mobile and embedded friendly.
Top enterprise-grade audio transcription software:
Amazon Transcribe: Amazon Transcribe is a popular Speech-to-Text engine with high accuracy and availability of various features. However, developers without an existing AWS (Amazon Web Services) account may struggle to get started. Amazon Transcribe offers a custom Speech-to-Text API for the healthcare industry and offers the first hour of transcription free every month for the first year of use, then charges $1.44 per hour.
Google Speech-to-Text: Google Speech-to-Text is another popular audio transcription engine with multi-language support and features. Like AWS, getting started with Google Speech-to-Text without an existing GCP (Google Cloud Platform) account might be complex. GCP offers a $300-worth credit for the first 6 months. After the credit cost of transcription only may go up to $2.16 per hour if one opts out of data logging (i.e. doesn’t allow Google to record audio data sent to Speech-to-Text.)
Microsoft Azure Speech-to-Text: Microsoft Azure is a very accurate Speech-to-Text engine with the flexibility to customize models and multi-language and various feature support. It offers five hours of free transcription per month. However, getting started with Microsoft Azure might be even more difficult than GCP and AWS if one is not familiar with them.
Nuance: Nuance is one of the oldest and most known vendors in the speech recognition industry. Its Dragon dictation software has been used by individuals and enterprises widely. Nuance does not offer a free trial of the free tier for its Speech-to-Text API. Nuance Dragon Anywhere, an application for end-users can be tried for free for a week. On this list, Nuance is the only vendor that doesn’t share its technology with developers.
IBM Watson Speech-to-Text: Like other cloud providers, IBM Watson Speech-to-Text also comes with multi-language support and various features. Despite the low accuracy, IBM Watson Speech-to-Text is one of the most flexible solutions to adapt generic speech-to-text models. Plus, it offers ease of use and integration if one is already using IBM Cloud services.
Picovoice Leopard Speech-to-Text: Leopard converts speech to text locally on the platform of choice, without sending data to a 3rd party cloud. This platform could be a web browser, mobile application, single board computer (Raspberry Pi) or a server. As a result, Leopard offers accurate, reliable and private transcription experiences by cutting the voice AI costs by 10 to 100x.
Compare the accuracy of Amazon Transcribe, Google Speech-to-Text, IBM Watson, Microsoft Azure and Picovoice Leopard with the engine of your choice against open data sets: LibriSpeech, TED-LIUM and Common Voice in minutes!