Evaluating audio transcription engines

June 17, 2022
Blog Thumbnail

To learn more about which voice recognition technology to use, read this strategy guide for voice applications

Recent advances in deep learning have made voice technology more accurate, accessible, and affordable. However, now with the availability, choosing the best voice recognition solution has also become more difficult, especially for the Speech-to-Text (STT) solutions. Speech-to-Text offers tremendous benefits to enterprises for several use cases: transcription, dictation (voice typing), closed caption, subtitles or building analytics solutions (keyword detection, topic detection, auto summarization, content moderation, sentiment analysis…) Even notorious voice assistants such as Alexa and Siri use Speech-to-Text engines following a wake word engine.

To learn more about which voice recognition technology to use, read this strategy guide for voice applications

Measuring the accuracy of Speech-to-Text Engines:

Build your own Automatic Speech Recognition (ASR) comparison tool with Picovoice’s open-source benchmark

Word Error Rate (WER) is the most commonly used method to measure the accuracy of an Automatic Speech Recognition software. WER is the ratio of edit distance between words in a reference transcript and the words in the output of the Speech-to-Text engine to the number of words in the reference transcript. It shows the percentage of errors in the transcript performed by an Automatic Speech Recognition software compared to the human transcription with no mistakes.

Accuracy can be boosted even further with customization. Out-of-the-box Speech-to-Text engines mostly struggle with industry-specific jargon, special names or homophones. For example, “arthritis” can be transcribed as “off right his” if the Speech-to-Text model is not customized accordingly.

The best approach to evaluate Speech-to-Text software is to test with real users and datasets in real-environment. Although almost every vendor claims that they offer “the best” and “the most accurate” Speech-to-Text software, the accuracy of the software depends on the data. To give a head start to enterprises, Picovoice shares an open-source and open-data reproducible benchmark for Amazon Transcribe, Google Speech-to-Text, IBM Watson, Microsoft Azure and Mozilla DeepSpeech. Anyone can use the existing data set or replace it with their own data to see how various engines perform.

Build your own Automatic Speech Recognition (ASR) comparison tool with Picovoice’s open-source benchmark

Choosing Speech-to-Text (STT) Engine features:

Automatic Speech Recognition (ASR) software enables machines to understand what humans say. All Speech-to-Text (STT) engines convert voice to text by transcribing it. However, input language, the audio or video format and sample rate or whether it’s a batch transcription or real-time transcription vary across different engines. Also, some Speech-to-Text engines may offer enhanced output via speaker diarization, automatic punctuation and truecasing, removing filler words and applying profanity filtering to make it easier to read. Additionally, Speech-to-Text engines provide word-level confidence and alternatives that can enable further analysis. For certain use cases, dual-channel transcription or custom spelling might be required as in the case of contact centers. Not all features will be required for all use cases. Choosing the best Speech-to-Text engine starts with an understanding of business requirements, and then working with a vendor that allows adding voice features on your terms.

Why technical support for Speech-to-Text Engines:

Free Tier users get GitHub community support from Picovoice

Humans hope for the best and prepare for the worst. When it comes to the worst, technical support makes a difference. Most Automatic Speech Recognition (ASR) vendors provide technical support as a part of user license agreements, or some of them charge an additional service fee as a fraction of the contract size. Availability of support becomes an important criterion, especially for free and open-source Speech-to-Text software solutions. The opportunity cost of disrupted service due to lack or limited support might be much higher than the cost of a software license depending on the use case.

Free Tier users get GitHub community support from Picovoice

Building with developer-friendly Automatic Speech Recognition (ASR) documentation:

As in any software development process, developers are the most important and expensive resource for integrating Speech-to-Text (STT) engines as well. Selecting a developer-friendly Speech-to-Text (STT) engine with easy-to-follow docs will improve developer productivity and shorten time-to-market. On the other hand, for developers who have expertise in Speech Recognition and familiarity with various tools ease of use might not be a consideration.

Considering the impact of the cloud on reliability:

Besides the performance of the engine, service availability (or downtime) and latency significantly affect the reliability of any voice product built with cloud-dependent Speech-to-Text engines. While service availability might not be a problem for batch audio transcription to convert archived voice to text, it can significantly affect the effectiveness of analytics solutions that offer custom recommendations based on the real-time transcription. While consumers may tolerate waiting for Alexa to play the next song, enterprises with mission-critical applications will be less likely to tolerate significant AWS outages that could happen a few times in a month.

How to build private transcription products:

On-device voice processing has instinctive advantages over cloud-dependent voice recognition. Learn more about Edge Voice AI

The recent Otter.ai incident has proven that cloud-dependent audio transcription is not completely private. The cloud dependency of Speech-to-Text becomes especially risky to transcribe audio recordings with personal information or trade secrets, and applications in highly-regulated verticals such as healthcare (HIPAA), or regions such as Europe (GDPR). To ensure data privacy, on-device Speech-to-Text engines should be preferred over cloud-based ASRs.

On-device voice processing has instinctive advantages over cloud-dependent voice recognition. Learn more about Edge Voice AI

How to evaluate the cost of a Speech-to-Text Engine:

Check out Picovoice’s Speech-to-Text engines cost $0.1/hour with the Starter Tier

Out-of-the-box Automatic Speech Recognition (ASR) engines are mostly more accurate, easier to customize and integrate and come with technical support. These benefits compared to the free and open-source speech recognition alternatives seem favourable. They come at a cost.

Open-source Speech-to-Text engines such as Mozilla DeepSpeech are free to use, and the cost of transcription with cloud providers such as Amazon Web Services Transcribe or Google Speech-to-Text starts at $1.44/hour. Depending on the volume, the cost of using enterprise-grade solutions can be significant. Moreover, cloud-related costs, such as storage should be included while calculating the total cost of ownership. Open-source ASR software, on the other hand, does not come up with enterprise-grade quality or support. The engineering cost to customize open source models and the opportunity cost of downtime could be a lot more than the purchase price of a Speech-to-Text engine.

Check out Picovoice’s Speech-to-Text engines cost $0.1/hour with the Starter Tier

Evaluating top speech recognition engines

Top free & open-source Speech-to-Text Engines:

Compare DeepSpeech and Leopard Model Sizes and RTF

Coqui: Coqui is founded by former Mozilla DeepSpeech engineers. Coqui’s deep learning-based Speech-to-Text (STT) engines support various pre-trained language models with the support of its community. With TensorFlow Lite, Coqui reduced the English model size to 47 MB to make it mobile and embedded friendly.

DeepSpeech: Although Mozilla stopped maintaining DeepSpeech, DeepSpeech is still one of the most favourable free and open-source Speech-to-Text engines. It is based on Baidu Deep Speech and implemented by using TensorFlow. DeepSpeech offers reasonably high accuracy and easy trainability with your data.

Compare DeepSpeech and Leopard Model Sizes and RTF



Kaldi: Kaldi is one of the oldest free and open-source speech recognition models and popular engines, especially among researchers and scientists. Although Kaldi is not leveraging the latest deep learning advances, like DeepSpeech, given its relatively good out-of-the-box accuracy and strong community Kaldi has been used by various enterprises as well.

SpeechBrain: SpeechBrain is a PyTorch-based transcription toolkit that offers tight integration with HuggingFace. The platform is currently in Beta and sponsored by large companies such as Nuance, NVIDIA and Samsung.

Vosk: Vosk is a free and open-source offline speech recognition API for mobile devices, Raspberry Pi and servers with Python, Java, C# and Node supporting 20+ languages and achieves model sizes as small as 50 MB.

Top enterprise-grade Speech-to-Text Engines:

Amazon Transcribe: Amazon Transcribe is a popular Speech-to-Text engine with high accuracy and availability of various features. However, developers without an existing AWS (Amazon Web Services) account may struggle to get started. Amazon Transcribe offers a custom Speech-to-Text API for the healthcare industry and offers the first hour of transcription free every month for the first year of use, then charges $1.44 per hour.

Google Speech-to-Text: Google Speech-to-Text is another popular audio transcription engine with multi-language support and features. Like AWS, getting started with Google Speech-to-Text without an existing GCP (Google Cloud Platform) account might be complex. GCP offers a $300-worth credit for the first 6 months. After the credit cost of transcription only may go up to $2.16 per hour if one opts out of data logging (i.e. doesn’t allow Google to record audio data sent to Speech-to-Text.)

Microsoft Azure: Microsoft Azure is a very accurate Speech-to-Text engine with the flexibility to customize models and multi-language and various feature support. It offers five hours of free transcription per month. However, getting started with Microsoft Azure might be even more difficult than GCP and AWS if one is not familiar with it.

Nuance: Nuance is one of the oldest and most known vendors in the speech recognition industry. Its Dragon dictation software has been used by individuals and enterprises widely. Nuance does not offer a free trial of the free tier for its Speech-to-Text API. Nuance Dragon Anywhere, an application for end-users can be tried for free for a week.

IBM Watson Speech-to-Text: Like other cloud providers, IBM Watson Speech-to-Text also comes with multi-language support and various features. It offers ease of use and integration if one is already using IBM Cloud services.

Compare the accuracy of AWS Transcribe, Google Speech-to-Text, Microsoft Azure, IBM Watson, Mozilla DeepSpeech and Picovoice Leopard with the engine of your choice against open data sets: ​​LibriSpeech, TED-LIUM and Common Voice in minutes!