Picovoice Wordmark
Start Building
Introduction
Introduction
AndroidC.NETFlutterlink to GoiOSJavaNVIDIA JetsonLinuxmacOSNodejsPythonRaspberry PiReact NativeRustWebWindows
AndroidC.NETFlutterlink to GoiOSJavaNodejsPythonReact NativeRustWeb
SummaryPicovoice LeopardAmazon TranscribeAzure Speech-to-TextGoogle ASRGoogle ASR (Enhanced)IBM Watson Speech-to-Text
FAQ
Introduction
AndroidC.NETFlutterlink to GoiOSJavaNVIDIA JetsonLinuxmacOSNodejsPythonRaspberry PiReact NativeRustWebWindows
AndroidC.NETFlutterlink to GoiOSJavaNodejsPythonReact NativeRustWeb
FAQ
Introduction
AndroidCiOSLinuxmacOSPythonWebWindows
AndroidCiOSPythonWeb
SummaryOctopus Speech-to-IndexGoogle Speech-to-TextMozilla DeepSpeech
FAQ
Introduction
AndroidAngularArduinoBeagleBoneCChrome.NETEdgeFirefoxFlutterlink to GoiOSJavaNVIDIA JetsonLinuxmacOSMicrocontrollerNodejsPythonRaspberry PiReactReact NativeRustSafariUnityVueWebWindows
AndroidAngularC.NETFlutterlink to GoiOSJavaMicrocontrollerNodejsPythonReactReact NativeRustUnityVueWeb
SummaryPorcupineSnowboyPocketSphinx
Wake Word TipsFAQ
Introduction
AndroidAngularBeagleBoneCChrome.NETEdgeFirefoxFlutterlink to GoiOSJavaNVIDIA JetsonLinuxmacOSNodejsPythonRaspberry PiReactReact NativeRustSafariUnityVueWebWindows
AndroidAngularC.NETFlutterlink to GoiOSJavaNodejsPythonReactReact NativeRustUnityVueWeb
SummaryPicovoice RhinoGoogle DialogflowAmazon LexIBM WatsonMicrosoft LUIS
Expression SyntaxFAQ
Introduction
AndroidBeagleboneCiOSNVIDIA JetsonLinuxmacOSPythonRaspberry PiRustWebWindows
AndroidCiOSPythonRustWeb
SummaryPicovoice CobraWebRTC VAD
FAQ
Introduction
AndroidCiOSNVIDIA JetsonLinuxmacOSPythonRaspberry PiWebWindows
AndroidCiOSPythonWeb
SummaryPicovoice KoalaMozilla RNNoise
Introduction
AndroidCiOSNVIDIA JetsonLinuxmacOSPythonRaspberry PiWebWindows
AndroidCPythoniOSWeb
Introduction
AndroidAngularArduinoBeagleBoneC.NETFlutterlink to GoiOSJavaNVIDIA JetsonMicrocontrollerNodejsPythonRaspberry PiReactReact NativeRustUnityVueWeb
AndroidAngularCMicrocontroller.NETFlutterlink to GoiOSJavaNodejsPythonReactReact NativeRustUnityVueWeb
Picovoice SDK - FAQ
IntroductionSTM32F407G-DISC1 (Arm Cortex-M4)STM32F411E-DISCO (Arm Cortex-M4)STM32F769I-DISCO (Arm Cortex-M7)IMXRT1050-EVKB (Arm Cortex-M7)
Introduction
AndroidC.NETFlutterlink to GoiOSNodejsPythonReact NativeRustUnityWeb
AndroidC.NETFlutterlink to GoiOSNodejsPythonReact NativeRustUnityWeb
FAQGlossary

Speech-to-Text Benchmark

Automatic speech recognition (ASR) is the core building block of most voice applications to the point that practitioners use speech-to-text (STT) and speech recognition interchangeably. ASR systems achieving state-of-the-art accuracy often run in the cloud. Amazon Transcribe , Azure Speech-to-Text , Google Speech-to-Text , and IBM Watson Speech-to-Text for the current dominant transcription API providers.

STT’s reliance on the cloud makes it costly, less reliable, and laggy. On-device ASRs can be orders of magnitude more cost-effective than API counterparts. Additionally, offline ASRs are inherently reliable and real-time by removing the variable delay induced by network connectivity. Running an ASR engine offline without sacrificing accuracy is challenging. Common approaches to audio transcription involve massive graphs for language modelling and compute-intensive neural networks for acoustic modelling. Picovoice’s Leopard speech-to-text engine takes a different approach to achieve cloud-level accuracy while running offline on commodity hardware like a Raspberry Pi.

Below is a series of benchmarks to back our claims. They also empower customers to make data-driven decisions using the datasets that matter to their business.

Methodology

Speech Corpus

We use the following datasets for benchmarks:

  • LibriSpeech test-clean
  • LibriSpeech test-other
  • Common Voice test
  • TED-LIUM test

Metrics

Word Error Rate (WER)

Word error rate is the ratio of edit distance between words in a reference transcript and the words in the output of the speech-to-text engine to the number of words in the reference transcript.

Price

The cost of an ASR engine is a function of usage (i.e. hours of audio processed). API-based speech-to-text offerings are only feasible if the economy of the use case cannot support about $1 per hour of audio processed. This mismatch can hinder STT's adoption in many high-volume applications such as content moderation, audio analytics, advertisement, and search.

Results

Accuracy

The figure below shows the accuracy of each engine averaged over all datasets.

Speech-to-Text Accuracy ComparisonSpeech-to-Text Accuracy Comparison

Usage

The data and code used to create this benchmark are available on GitHub under the permissive Apache 2.0 license. Detailed instructions for benchmarking individual engines are provided in the following documents:

  • AWS Transcribe accuracy
  • Azure Speech-to-Text accuracy
  • Google Speech-to-Text accuracy
  • Google Speech-to-Text (Enhanced) accuracy
  • IBM Watson Speech-to-Text accuracy
  • Picovoice Leopard accuracy

Was this doc helpful?

Issue with this doc?

Report a GitHub Issue
Speech-to-Text Benchmark
  • Methodology
  • Speech Corpus
  • Metrics
  • Price
  • Results
  • Accuracy
  • Usage
Platform
  • Leopard Speech-to-Text
  • Cheetah Streaming Speech-to-Text
  • Koala Noise Suppression
  • Eagle Speaker RecognitionBETA
  • Octopus Speech-to-Index
  • Porcupine Wake Word
  • Rhino Speech-to-Intent
  • Cobra Voice Activity Detection
  • Orca Text-to-SpeechWAITLIST
  • Falcon Speaker DiarizationWAITLIST
Resources
  • Docs
  • Console
  • Blog
  • Use Cases
Sales & Services
  • Consulting
  • Developer Plan
  • Enterprise Plan
  • Support Add-on
Company
  • About us
  • Careers
Follow Picovoice
  • LinkedIn
  • GitHub
  • Twitter
  • Medium
  • YouTube
  • AngelList
Subscribe to our newsletter
Terms of Use
Privacy Policy
© 2019-2022 Picovoice Inc.