Recent advances in deep learning have made voice technology more accurate, accessible, and affordable. However, now with the availability, choosing the best voice recognition solution has also become more difficult, especially for the Speech-to-Text (STT) solutions. Speech-to-Text offers tremendous benefits to enterprises for several use cases: transcription, dictation (voice typing), closed caption, subtitles or building analytics solutions (keyword detection, topic detection, auto summarization, content moderation, sentiment analysis…) Even notorious voice assistants such as Alexa and Siri use Speech-to-Text engines following a wake word engine.

The variety of the use cases requires enterprises to evaluate Speech-to-Text (STT) solutions in line with their needs: accuracy, features, support, documentation, reliability, privacy & security, volume and cost. A Speech-to-Text engine that works very well for one company may not be a fit for another one. Going back to the title of this article, “the best” speech-to-text is the one responds to the most, ideally all, of your needs.

We noted a need for fact-based and transparent tools to enable data-driven decision-making.

1. Accuracy

Word Error Rate (WER) is the most commonly used method to measure the accuracy of an Automatic Speech Recognition software. WER is the ratio of edit distance between words in a reference transcript and the words in the output of the Speech-to-Text engine to the number of words in the reference transcript. It shows the percentage of errors in the transcript performed by an Automatic Speech Recognition software compared to the human transcription with no mistakes.

Accuracy can be boosted even further with customization. Out-of-the-box Speech-to-Text engines mostly struggle with industry-specific jargon, special names or homophones. For example, “arthritis” can be transcribed as “off right his” if the Speech-to-Text model is not customized accordingly.

The best approach to evaluate Speech-to-Text software is to test with real users and datasets in real-environment. Although almost every vendor claims that they offer “the best” and “the most accurate” Speech-to-Text software, the accuracy of the software depends on the data. To give a head start to enterprises, Picovoice shares an open-source and open-data reproducible benchmark for Amazon Transcribe, Google Speech-to-Text, IBM Watson, Microsoft Azure and Mozilla DeepSpeech. Anyone can use the existing data set or replace it with their own data to see how various engines perform.

2. Features

Automatic Speech Recognition (ASR) software enables machines to understand what humans say. All Speech-to-Text (STT) engines convert voice to text by transcribing it. However, input language, the audio or video format and sample rate or whether it’s a batch transcription or real-time transcription vary across different engines. Also, some Speech-to-Text engines may offer enhanced output via speaker diarization, automatic punctuation and truecasing, removing filler words and applying profanity filtering to make it easier to read. Additionally, Speech-to-Text engines provide word-level confidence and alternatives that can enable further analysis. For certain use cases, dual-channel transcription or custom spelling might be required as in the case of contact centers. Not all features will be required for all use cases. Choosing the best Speech-to-Text engine starts with an understanding of business requirements, and then working with a vendor that allows adding voice features on your terms.

3. Technical Support

Humans hope for the best and prepare for the worst. When it comes to the worst, technical support makes a difference. Most Automatic Speech Recognition (ASR) vendors provide technical support as a part of user license agreements, or some of them charge an additional service fee as a fraction of the contract size. Availability of support becomes an important criterion, especially for free and open-source Speech-to-Text software solutions. The opportunity cost of disrupted service due to lack or limited support might be much higher than the cost of a software license depending on the use case.

4. Documentation and Ease of Use

As in any software development process, developers are the most important and expensive resource for integrating Speech-to-Text (STT) engines as well. Selecting a developer-friendly Speech-to-Text (STT) engine with easy-to-follow docs will improve developer productivity and shorten time-to-market. On the other hand, for developers who have expertise in Speech Recognition and familiarity with various tools ease of use might not be a consideration.

5. Connectivity and Latency

Besides the performance of the engine, service availability (or downtime) and latency significantly affect the reliability of any voice product built with cloud-dependent Speech-to-Text engines. While service availability might not be a problem for batch audio transcription to convert archived voice to text, it can significantly affect the effectiveness of analytics solutions that offer custom recommendations based on real-time transcription. While consumers may tolerate waiting for Alexa to play the next song, enterprises with mission-critical applications will be less likely to tolerate significant AWS outages that could happen a few times in a month.

6. Privacy

The recent Otter.ai incident has proven that cloud-dependent audio transcription is not completely private. The cloud dependency of Speech-to-Text becomes especially risky to transcribe audio recordings with personal information or trade secrets, and applications in highly-regulated verticals such as healthcare (HIPAA), or regions such as Europe (GDPR). To ensure data privacy, on-device Speech-to-Text engines should be preferred over cloud-based ASRs.

7. Total Cost of Ownership

Out-of-the-box Automatic Speech Recognition (ASR) engines are mostly more accurate, easier to customize and integrate and come with technical support. These benefits compared to the free and open-source speech recognition alternatives seem favourable. They come at a cost.

Open-source Speech-to-Text engines such as Mozilla DeepSpeech are free to use, whereas the cost of transcription with cloud providers such as Amazon Web Services Transcribe or Google Speech-to-Text starts at $1.44/hour. Depending on the volume, the cost of using enterprise-grade solutions can be significant. Moreover, cloud-related costs, such as storage should be included while calculating the total cost of ownership.

Open-source STT, on the other hand, does not come up with enterprise-grade quality or support. The engineering cost to customize and maintain open source models should be calculated for accurate comparison. In some use cases, the opportunity cost of downtime could be a lot more than the purchase price of a Speech-to-Text engine.