Selecting the Best Speech-to-Text Engine

🎯 Voice AI Consulting

Get dedicated support and consultation to ensure your specific needs are met.

Recent advances in deep learning have made voice technology more accurate, accessible, and affordable. Alongside these developments arises increased availability, making it difficult to choose the best voice AI engines - especially for Speech-to-Text (STT) also known as Automatic Speech Recognition engines. Speech-to-Text offers enterprises tremendous benefits for several use cases: transcription, dictation (voice typing), closed captions, subtitles, or building analytics solutions (such as keyword spotting, topic detection, auto summarization, content moderation, or sentiment analysis). A popular use case is the notorious voice assistants, Alexa or Siri, who use Speech-to-Text engines in combination with a wake word, natural language understanding, and text-to-speech engines.

As per the title of this blog, "the best" Speech-to-Text engine differs depending on the enterprise's needs. The ideal Speech-to-Text engine for one company may be an awful fit for another. Enterprises must evaluate Speech-to-Text solutions with their needs in mind - some examples include accuracy, features, support, documentation, reliability, privacy and security, or volume and cost. Below is a fact-based and transparent guide to aid in enabling data-driven decision-making.

1. Accuracy

Word Error Rate (WER) is the most commonly used method to measure the accuracy of Automatic Speech Recognition software. WER measures the percentage of errors in a transcript produced by an Automatic Speech Recognition software compared to the original human speech.

Customization can further boost accuracy. Out-of-the-box Speech-to-Text engines mostly struggle with industry-specific jargon, such as special names or homophones. Customizing the Speech-to-Text engine allows it to identify common phrases specific to your enterprise, ultimately boosting its accuracy. For example, a Speech-to-Text model without customization can incorrectly transcribe "arthritis" as "off right his".

The best approach to evaluate Speech-to-Text software is to test it with real users and datasets in its actual environment. The accuracy of a Speech-to-Text engine is dependent on the data, regardless of the vendor chosen. To compare the performance of various Speech-to-Text engines, Picovoice has an open-source and open-data reproducible benchmark for Amazon Transcribe, Google Speech-to-Text, IBM Watson, and Microsoft Azure. Analyzing the existing data set, or replacing it with your own, allows you to visualize the accuracy of the available software.

Adding custom words, boosting phrases, and adopting language or acoustic models are |alternatives to improve ASR accuracy.

2. Features

Automatic Speech Recognition software enables machines to understand what humans say. In other words, all Speech-to-Text engines convert voice to text by transcribing it. However, each engine may offer various features and have requirements: including the input language (whether it's in audio or video format), the sample rate, or the type of transcription (batch or real-time).

Some Speech-to-Text engines may offer enhanced output via speaker diarization, automatic punctuation and truecasing, removing filler words, and applying profanity filtering to ease readability. Additionally, Speech-to-Text engines provide word-level confidence and alternatives that can enable further analysis.

Not all Speech-to-Text features are required in all use cases. For example, dual-channel transcription or custom spelling is required by contact centers, but not for other enterprises. Choosing the best Speech-to-Text engine begins with understanding your business requirements, and then working with a vendor that enables adding voice features on your terms.

3. Technical Support

The tradeoff between cost and technical support should be a vital consideration when choosing the optimal Speech-to-Text engine. Most Automatic Speech Recognition vendors provide technical support under user license agreements or charge an additional service fee for a fraction of the contract size. Technical support is often overlooked in free and open-source Speech-to-Text software. In the ideal world, disruption-free conditions make open-source software appealing. However, the ideal world does not exist and disruptions are guaranteed to occur. The cost of disrupted service with limited or minimal support might exceed the software license cost, depending on the use case. It is important to consider these potential costs and disruptions when choosing a Speech-to-Text engine.

4. Documentation and Ease of Use

As with any software development process, developers are the most important and expensive resource for integrating Speech-to-Text engines. Selecting a developer-friendly Speech-to-Text engine with easy-to-follow docs will improve developer productivity and shorten time-to-market. Alternatively, ease of use may not be a consideration for developers with expertise in Speech Recognition and familiarity with various tools.

5. Connectivity and Latency

Other than the performance of the engine, service availability (downtime) and latency significantly affect the reliability of any voice product built with a cloud-dependent Speech-to-Text engine. While service availability might not be an issue for batch audio transcription, it can influence the effectiveness of analytics solutions that offer custom recommendations based on real-time transcription. In certain use cases, this may be problematic. For example, consumers may tolerate waiting for Alexa to play the next song, but enterprises with mission-critical applications will be less tolerant of frequent AWS outages.

6. Privacy

The recent Otter.ai incident has proven that cloud-dependent audio transcription is not completely private. The cloud dependency of Speech-to-Text becomes especially important when transcribing audio containing personal information or trade secrets. This could include applications in highly-regulated verticals such as healthcare (HIPAA), or European regions (GDPR). To ensure data privacy, on-device Speech-to-Text engines should be preferred over cloud-based ASRs.

Edge Voice AI has instinctive advantages over cloud-dependent voice recognition. Since voice |data doesn’t leave the device, it’s private by design.

7. Total Cost of Ownership

Out-of-the-box Automatic Speech Recognition engines are generally more accurate, easier to customize and integrate, and come with more technical support compared to the free, open-source speech recognition alternatives. All of these benefits make out-of-the-box ASR engines seem favorable, however they come at a cost.

For example, open-source Speech-to-Text engines (i.e. Mozilla DeepSpeech) are free to use; cloud providers, e.g., Amazon Web Services Transcribe and Google Speech-to-Text cost $1.44/hour. Depending on the volume, the cost of using enterprise-grade solutions can be significant. Moreover, cloud-related costs, such as storage, should also be considered while calculating the total cost of ownership.

Open-source Speech-to-Text does not come with enterprise-grade quality or support. The engineering cost to customize and maintain open-source models should be calculated to make accurate comparisons. In some use cases, the opportunity cost of downtime could greatly exceed the purchase price of a Speech-to-Text engine.

Picovoice’s homegrown Speech-to-Text engines offer the best of open-source and cloud-dependent Speech-to-Text APIs. They’re as convenient as cloud APIs when it comes to accuracy, customization, availability of features, ease of use, and platform support; and offer privacy and no latency like open-source alternatives by processing voice data locally without sending to 3rd party remote servers.

Start Free

A Guide to Selecting the Best Speech-to-Text Engine