Choosing the best audio transcription software is challenging. First and foremost, audio transcription software is used for different use cases and needs. There are several things to be considered while selecting a transcription engine, but they change from one use case to another. Another challenge for enterprises is to decide whether to buy, build or open-source. Although every voice vendor claims they have “the best” software, unfortunately, there is no one software that suits every need.
The article covers DeepSpeech
, Whisper
, Kaldi
, SpeechBrain
, Vosk
, Coqui
, Amazon Transcribe
, Google Speech-to-Text
, Microsoft Azure Speech-to-Text
, Nuance
, IBM Watson Speech-to-Text
and Picovoice’s own Leopard Speech-to-Text
.
Top free & open-source speech-to-text software:
DeepSpeech : Although Mozilla stopped maintaining DeepSpeech , DeepSpeech
is still one of the most favourable free and open-source Speech-to-Text software. It is based on Baidu DeepSpeech
and implemented by using Tensorflow. DeepSpeech
offers reasonably high accuracy and easy trainability with your data.
Whisper : OpenAI announced open-source Whisper
Speech-to-Text that processes voice data on-device in September 2022. The number of languages it supports and the additional translation capabilities grabbed developers’ attention. Whisper
has five different models of varying sizes and capabilities. While Whisper
offers on-device speech recognition, it requires a fast GPU and an in-house team to maintain, scale, update, and monitor the model to run Whisper
at a large scale for enterprise applications. Given its primary intended users are AI researchers , determining the cost of using Whisper
for enterprise applications is crucial before starting to build.
Kaldi : Kaldi
is one of the oldest free and open-source speech recognition models and popular engines, especially among researchers and scientists. Although Kaldi
is not leveraging the latest deep learning advances, like DeepSpeech
, given its relatively good out-of-the-box accuracy and strong community, some enterprises still use Kaldi.
SpeechBrain : SpeechBrain
is a PyTorch-based transcription toolkit that offers tight integration with HuggingFace. The platform is currently in Beta and sponsored by large companies such as Nuance, NVIDIA and Samsung.
Vosk : Vosk
is a free and open-source offline speech recognition API for mobile devices, Raspberry Pi and servers with Python, Java, C# and Node supporting 20+ languages and achieves model sizes as small as 50 MB.
Coqui : Coqui
is founded by former Mozilla DeepSpeech engineers. Coqui
’s deep learning-based Speech-to-Text (STT) engines support various pre-trained language models with the support of its community. With TensorFlow Lite, Coqui
reduced the English model size to 47 MB to make it mobile and embedded friendly.
Top enterprise-grade audio transcription software:
Amazon Transcribe : Amazon Transcribe
is a popular Speech-to-Text engine with high accuracy and availability of various features. Developers with limited AWS (Amazon Web Services) experience may not start immediately. Amazon Transcribe
offers a custom Speech-to-Text API for the healthcare industry, and the first hour of transcription is free every month for the first year of use, then charges $1.44 per hour.
Do you know you can build a serverless Speech-to-Text Engine with AWS Lambda and Picovoice Leopard?
Google Speech-to-Text : Google Speech-to-Text
is another popular audio transcription engine with multi-language support and features. Like AWS, getting started with Google Speech-to-Text
without an existing GCP (Google Cloud Platform) account might be complex. GCP offers a $300-worth credit for the first six months. After the credit cost of transcription only may go up to $2.16 per hour if one opts out of data logging. In other words, Google charges extra for not using the audio sent to its cloud.
You can also build a transcription microservice with gRPC and Picovoice Leopard
Microsoft Azure Speech-to-Text : Microsoft Azure
is a very accurate Speech-to-Text engine with the flexibility to customize models and multi-language and various feature support. It offers five hours of free transcription per month. However, getting started with Microsoft Azure
might be even more difficult than GCP and AWS if one is not familiar with them.
Nuance : Nuance
is now a Microsoft company. It is one of the oldest and most known vendors in the speech recognition industry. Its Dragon dictation software has been used by individuals and enterprises widely. Nuance
does not offer a free trial of the free tier for its Speech-to-Text API. Nuance
Dragon Anywhere, the end-user application, can be tried for free for a week.
IBM Watson Speech-to-Text : Like other cloud providers, IBM Watson Speech-to-Text
also comes with multi-language support and various features. Despite the low accuracy, IBM Watson Speech-to-Text
is one of the most flexible solutions to adapt generic speech-to-text models. Plus, it offers ease of use and integration if one uses IBM Cloud services.
Picovoice Leopard Speech-to-Text: Leopard
converts speech to text locally on the preferred platform without sending data to a 3rd party cloud. This platform could be a web browser, mobile application, single-board computer (Raspberry Pi) or server. As a result, Leopard
offers accurate, reliable and private transcription experiences by cutting the voice AI costs by 10 to 100x.
Compare the accuracy of Amazon Transcribe, Google Speech-to-Text, IBM Watson, Microsoft Azure and Picovoice Leopard with the engine of your choice against open data sets : LibriSpeech, TED-LIUM and Common Voice in minutes!
If you’re also looking for NLU engines, check out our article on the top NLU engines.