Local Speech-to-Text with Cloud-Level AccuracySeptember 17, 2019
Voice recognition has flourished with the growth of cloud-based speech services. Despite the ubiquity of voice-enabled products, processing speech in the cloud has raised major privacy concerns with the uploading and handling of personal voice data. Cloud speech recognition also has fundamental limitations with cost-effectiveness, latency, and reliability.
Offline speech recognition has the potential to address cloud service drawbacks by eliminating the need for connectivity and tapping into readily available compute resources on billions of devices. Alas, the computational cost of speech recognition algorithms to date has made it impossible to get comparable accuracy on commodity edge devices.
Picovoice has developed deep learning technology that is specifically designed to perform large vocabulary speech-to-text efficiently on the edge. Picovoice software runs on commodity hardware with constrained compute resources. Bespoke voice AI technology allows speech-to-text on even a $5 Raspberry Pi Zero, recognizing more than 200,000 words in real-time. The Picovoice offline option lowers cost and latency while matching the accuracy of cloud voice services.
Picovoice has benchmarked the accuracy of its speech-to-text engine against four widely-used engines: Google Speech-to-Text, Amazon Transcribe, Mozilla DeepSpeech, and CMU PocketSphinx The data, code, and test setup for the benchmark are open source The figure below shows that Picovoice achieves accuracy comparable to cloud-based services.
Additionally, runtime metrics of the offline engines are compared in the figure below. Picovoice achieves an accuracy comparable to Mozilla DeepSpeech while being 23x faster and consuming 57x less memory.
At the time of writing, cloud-based speech-to-text services charge well above $1 per hour    For enterprises that need to process millions of hours of audio, this unbounded cost can be prohibitive. Picovoice can offer alternative licensing models which save more than 10x at scale.