Recent advances in deep learning have made voice technology more accurate, accessible, and affordable. Alongside these developments arises increased availability, making it difficult to choose the best voice AI engines - especially for Speech-to-Text (STT
) also known as Automatic Speech Recognition
engines. Speech-to-Text
offers enterprises tremendous benefits for several use cases: transcription, dictation (voice typing), closed captions, subtitles, or building analytics solutions (such as keyword spotting, topic detection, auto summarization, content moderation, or sentiment analysis). A popular use case is the notorious voice assistants, Alexa or Siri, who use Speech-to-Text
engines in combination with a wake word, natural language understanding, and text-to-speech engines.
As per the title of this blog, "the best" Speech-to-Text
engine differs depending on the enterprise's needs. The ideal Speech-to-Text
engine for one company may be an awful fit for another. Enterprises must evaluate Speech-to-Text
solutions with their needs in mind - some examples include accuracy, features, support, documentation, reliability, privacy and security, or volume and cost. Below is a fact-based and transparent guide to aid in enabling data-driven decision-making.
1. Accuracy
Word Error Rate (WER) is the most commonly used method to measure the accuracy of Automatic Speech Recognition software. WER measures the percentage of errors in a transcript produced by an Automatic Speech Recognition
software compared to the original human speech.
Customization can further boost accuracy. Out-of-the-box Speech-to-Text
engines mostly struggle with industry-specific jargon, such as special names or homophones. Customizing the Speech-to-Text
engine allows it to identify common phrases specific to your enterprise, ultimately boosting its accuracy. For example, a Speech-to-Text
model without customization can incorrectly transcribe "arthritis" as "off right his".
The best approach to evaluate Speech-to-Text
software is to test it with real users and datasets in its actual environment. The accuracy of a Speech-to-Text
engine is dependent on the data, regardless of the vendor chosen. To compare the performance of various Speech-to-Text
engines, Picovoice has an open-source and open-data reproducible benchmark for Amazon Transcribe, Google Speech-to-Text, IBM Watson, and Microsoft Azure. Analyzing the existing data set, or replacing it with your own, allows you to visualize the accuracy of the available software.
Adding custom words, boosting phrases, and adopting language or acoustic models are |alternatives to improve ASR accuracy.
2. Features
Automatic Speech Recognition
software enables machines to understand what humans say. In other words, all Speech-to-Text
engines convert voice to text by transcribing it. However, each engine may offer various features and have requirements: including the input language (whether it's in audio or video format), the sample rate, or the type of transcription (batch or real-time).
Some Speech-to-Text
engines may offer enhanced output via speaker diarization, automatic punctuation and truecasing, removing filler words, and applying profanity filtering to ease readability. Additionally, Speech-to-Text
engines provide word-level confidence and alternatives that can enable further analysis.
Not all Speech-to-Text
features are required in all use cases. For example, dual-channel transcription or custom spelling is required by contact centers, but not for other enterprises. Choosing the best Speech-to-Text
engine begins with understanding your business requirements, and then working with a vendor that enables adding voice features on your terms.
3. Technical Support
The tradeoff between cost and technical support should be a vital consideration when choosing the optimal Speech-to-Text
engine. Most Automatic Speech Recognition
vendors provide technical support under user license agreements or charge an additional service fee for a fraction of the contract size. Technical support is often overlooked in free and open-source Speech-to-Text
software. In the ideal world, disruption-free conditions make open-source software appealing. However, the ideal world does not exist and disruptions are guaranteed to occur. The cost of disrupted service with limited or minimal support might exceed the software license cost, depending on the use case. It is important to consider these potential costs and disruptions when choosing a Speech-to-Text
engine.
4. Documentation and Ease of Use
As with any software development process, developers are the most important and expensive resource for integrating Speech-to-Text
engines. Selecting a developer-friendly Speech-to-Text
engine with easy-to-follow docs will improve developer productivity and shorten time-to-market. Alternatively, ease of use may not be a consideration for developers with expertise in Speech Recognition and familiarity with various tools.
5. Connectivity and Latency
Other than the performance of the engine, service availability (downtime) and latency significantly affect the reliability of any voice product built with a cloud-dependent Speech-to-Text
engine. While service availability might not be an issue for batch audio transcription, it can influence the effectiveness of analytics solutions that offer custom recommendations based on real-time transcription. In certain use cases, this may be problematic. For example, consumers may tolerate waiting for Alexa to play the next song, but enterprises with mission-critical applications will be less tolerant of frequent AWS outages.
6. Privacy
The recent Otter.ai incident has proven that cloud-dependent audio transcription is not completely private. The cloud dependency of Speech-to-Text becomes especially important when transcribing audio containing personal information or trade secrets. This could include applications in highly-regulated verticals such as healthcare (HIPAA), or European regions (GDPR). To ensure data privacy, on-device Speech-to-Text
engines should be preferred over cloud-based ASRs.
Edge Voice AI has instinctive advantages over cloud-dependent voice recognition. Since voice |data doesn’t leave the device, it’s private by design.
7. Total Cost of Ownership
Out-of-the-box Automatic Speech Recognition
engines are generally more accurate, easier to customize and integrate, and come with more technical support compared to the free, open-source speech recognition alternatives. All of these benefits make out-of-the-box ASR engines seem favorable, however they come at a cost.
For example, open-source Speech-to-Text
engines (i.e. Mozilla DeepSpeech) are free to use; cloud providers, e.g., Amazon Web Services Transcribe and Google Speech-to-Text
cost $1.44/hour. Depending on the volume, the cost of using enterprise-grade solutions can be significant. Moreover, cloud-related costs, such as storage, should also be considered while calculating the total cost of ownership.
Open-source Speech-to-Text
does not come with enterprise-grade quality or support. The engineering cost to customize and maintain open-source models should be calculated to make accurate comparisons. In some use cases, the opportunity cost of downtime could greatly exceed the purchase price of a Speech-to-Text
engine.
Picovoice’s homegrown Speech-to-Text
engines offer the best of open-source and cloud-dependent Speech-to-Text
APIs. They’re as convenient as cloud APIs when it comes to accuracy, customization, availability of features, ease of use, and platform support; and offer privacy and no latency like open-source alternatives by processing voice data locally without sending to 3rd party remote servers.