Choosing the best audio transcription software is challenging. First and foremost, audio transcription software is used for different use cases and needs. There are several things to be considered while selecting a transcription engine, but they change from one use case to another. Another challenge for enterprises is to decide whether to buy, build or open-source. Although every voice vendor claims they have “the best” software, unfortunately, there is no one software that suits every need.
The article covers
Microsoft Azure Speech-to-Text,
IBM Watson Speech-to-Text and Picovoice’s own
Top free & open-source speech-to-text software:
DeepSpeech : Although Mozilla stopped maintaining DeepSpeech ,
DeepSpeech is still one of the most favourable free and open-source Speech-to-Text software. It is based on Baidu
DeepSpeech and implemented by using Tensorflow.
DeepSpeech offers reasonably high accuracy and easy trainability with your data.
Whisper : OpenAI announced open-source
Whisper Speech-to-Text that processes voice data on-device in September 2022. The number of languages it supports and the additional translation capabilities grabbed developers’ attention.
Whisper has five different models of varying sizes and capabilities. While
Whisper offers on-device speech recognition, it requires a fast GPU and an in-house team to maintain, scale, update, and monitor the model to run
Whisper at a large scale for enterprise applications. Given its primary intended users are AI researchers , determining the cost of using
Whisper for enterprise applications is crucial before starting to build.
Kaldi is one of the oldest free and open-source speech recognition models and popular engines, especially among researchers and scientists. Although
Kaldi is not leveraging the latest deep learning advances, like
DeepSpeech, given its relatively good out-of-the-box accuracy and strong community, some enterprises still use Kaldi.
SpeechBrain is a PyTorch-based transcription toolkit that offers tight integration with HuggingFace. The platform is currently in Beta and sponsored by large companies such as Nuance, NVIDIA and Samsung.
Vosk is a free and open-source offline speech recognition API for mobile devices, Raspberry Pi and servers with Python, Java, C# and Node supporting 20+ languages and achieves model sizes as small as 50 MB.
Coqui is founded by former Mozilla DeepSpeech engineers.
Coqui’s deep learning-based Speech-to-Text (STT) engines support various pre-trained language models with the support of its community. With TensorFlow Lite,
Coqui reduced the English model size to 47 MB to make it mobile and embedded friendly.
Top enterprise-grade audio transcription software:
Amazon Transcribe :
Amazon Transcribe is a popular Speech-to-Text engine with high accuracy and availability of various features. Developers with limited AWS (Amazon Web Services) experience may not start immediately.
Amazon Transcribe offers a custom Speech-to-Text API for the healthcare industry, and the first hour of transcription is free every month for the first year of use, then charges $1.44 per hour.
Do you know you can build a serverless Speech-to-Text Engine with AWS Lambda and Picovoice Leopard?
Google Speech-to-Text :
Google Speech-to-Text is another popular audio transcription engine with multi-language support and features. Like AWS, getting started with
Google Speech-to-Text without an existing GCP (Google Cloud Platform) account might be complex. GCP offers a $300-worth credit for the first six months. After the credit cost of transcription only may go up to $2.16 per hour if one opts out of data logging. In other words, Google charges extra for not using the audio sent to its cloud.
You can also build a transcription microservice with gRPC and Picovoice Leopard
Microsoft Azure Speech-to-Text :
Microsoft Azure is a very accurate Speech-to-Text engine with the flexibility to customize models and multi-language and various feature support. It offers five hours of free transcription per month. However, getting started with
Microsoft Azure might be even more difficult than GCP and AWS if one is not familiar with them.
Nuance is now a Microsoft company. It is one of the oldest and most known vendors in the speech recognition industry. Its Dragon dictation software has been used by individuals and enterprises widely.
Nuance does not offer a free trial of the free tier for its Speech-to-Text API.
Nuance Dragon Anywhere, the end-user application, can be tried for free for a week.
IBM Watson Speech-to-Text : Like other cloud providers,
IBM Watson Speech-to-Text also comes with multi-language support and various features. Despite the low accuracy,
IBM Watson Speech-to-Text is one of the most flexible solutions to adapt generic speech-to-text models. Plus, it offers ease of use and integration if one uses IBM Cloud services.
Picovoice Leopard Speech-to-Text:
Leopard converts speech to text locally on the preferred platform without sending data to a 3rd party cloud. This platform could be a web browser, mobile application, single-board computer (Raspberry Pi) or server. As a result,
Leopard offers accurate, reliable and private transcription experiences by cutting the voice AI costs by 10 to 100x.
Compare the accuracy of Amazon Transcribe, Google Speech-to-Text, IBM Watson, Microsoft Azure and Picovoice Leopard with the engine of your choice against open data sets : LibriSpeech, TED-LIUM and Common Voice in minutes!
If you’re also looking for NLU engines, check out our article on the top NLU engines.