Automatic speech recognition (ASR) is the core building block of most voice applications to the point that practitioners use speech-to-text (STT) and speech recognition interchangeably. ASR systems achieving state-of-the-art accuracy often run in the cloud. Amazon Transcribe , Azure Speech-to-Text , Google Speech-to-Text , and IBM Watson Speech-to-Text for the current dominant transcription API providers.
STT’s reliance on the cloud makes it costly, less reliable, and laggy. On-device ASRs can be orders of magnitude more cost-effective than API counterparts. Additionally, offline ASRs are inherently reliable and real-time by removing the variable delay induced by network connectivity. Running an ASR engine offline without sacrificing accuracy is challenging. Common approaches to audio transcription involve massive graphs for language modelling and compute-intensive neural networks for acoustic modelling. Picovoice’s Leopard speech-to-text engine takes a different approach to achieve cloud-level accuracy while running offline on commodity hardware like a Raspberry Pi.
Below is a series of benchmarks to back our claims. They also empower customers to make data-driven decisions using the datasets that matter to their business.
We use the following datasets for benchmarks:
- Common Voice
Word Error Rate (WER)
Word error rate is the ratio of edit distance between words in a reference transcript and the words in the output of the speech-to-text engine to the number of words in the reference transcript.
The cost of an ASR engine is a function of usage (i.e. hours of audio processed). API-based speech-to-text offerings are only feasible if the economy of the use case cannot support about $1 per hour of audio processed. This mismatch can hinder STT's adoption in many high-volume applications such as content moderation, audio analytics, advertisement, and search.
The figure below shows the accuracy of each engine averaged over all datasets.
The figure below compares the operational cost of voice recognition engines.
The data and code used to create this benchmark are available on GitHub under the permissive Apache 2.0 license. Detailed instructions for benchmarking individual engines are provided in the following documents:
- AWS Transcribe accuracy
- Azure Speech-to-Text accuracy
- Google Speech-to-Text accuracy
- Google Speech-to-Text (Enhanced) accuracy
- IBM Watson Speech-to-Text accuracy
- Picovoice Leopard accuracy
Pricing data for speech recognition and NLU APIs are available on the providers’ websites: