Leopard Speech-to-Text: Local Transcription FAQ

What is WER? How do you calculate the word error rate?

WER stands for Word Error Rate. The Word Error Rate (WER) is calculated by dividing the sum of the errors by the total number of reference words. There are three types of error included in the calculation: substitution (the reference word is replaced by another), insertion (a word is hypothesized that was not in the reference), and deletion (a word in the reference transcription is missed).

Despite its limitations, WER is the most commonly used metric to measure speech-to-text engine accuracy. A lower WER (lower number of errors) means better accuracy in recognizing speech. You can check Picovoice’s open-source speech-to-text benchmark to see an application of WER to compare the accuracy of speech-to-text engines.

What is a good word error rate?

A good error rate depends on your use case. Major cloud providers achieve WER smaller than 25%, while 15% or less is more acceptable now. However, while evaluating WER, one should be careful with the data set. The performance of the engines varies across datasets. Some vendors may pick the data sets accordingly for marketing purposes. If the errors happen with all the important words for your use case, even if the WER is 1%, it may not mean a good WER for you.

How do I test Leopard Speech-to-Text?

To test how Leopard Speech-to-Text transcribes audio files to text or converts voice recordings to text, check out the Leopard web demo. The demo processes data locally within your web browser.

How do I measure the performance of Leopard Speech-to-Text?

Picovoice publishes open-source, open-data benchmarks to showcase its voice recognition capabilities. Here’s the link for the speech-to-text benchmark.

How do I measure Leopard Speech-to-Text accuracy?

Picovoice has built, and open-sourced speech-to-text benchmark in order to enable developers to measure the accuracy of Leopard Speech-to-Text and alternatives. You can reproduce it and use your own data or open-source speech-to-text data sets to evaluate the accuracy.

How do I improve audio transcription accuracy?

It’s a known phenomenon that general speech-to-text transcriptions are not good at capturing certain things, such as proper names and homophones. Models should be adapted to the use case to improve transcription software accuracy.

Transcribing “Bob Loblaw” as “blah blah blah” may not be important if you’re transcribing lecture notes where “Bob Loblaw” is used as just a small example. However, it is important if you’re building an application for Fox and Netflix which own the rights of Arrested Development, or Loblaw - the Canadian retailer.

There are multiple ways to adapt Automatic Speech Recognition (ASR) to your domain and application, from Adding Custom Words, Boosting Phrases, and Language Model Adaptation to Acoustic Model Adaptation. The level of investment depends on the strategy and requirements. Adding Custom Words and Boosting Phrases on the self-service Picovoice Console does not require any coding experience.

How do I develop a custom speech-to-text model?

Developing a custom speech-to-text model from scratch is one of the ways to improve the accuracy of Automatic Speech Recognition (ASR). Training a custom speech-to-text model that performs well is a challenging and expensive task. However, in certain cases, such as speech-to-text for kids, it’s required.

You can engage with Picovoice Consulting experts when you become an Enterprise Plan customer to get a custom speech-to-text model trained.

How long does it take to transcribe 1-hour audio?

It depends on the method and tools used for transcription. It takes human transcribers approximately 4 hours to transcribe hour-long audio. For ASR Cloud APIs, it depends on various factors, including connectivity (how fast audio files are sent for processing), latency, and performance of the engine. The first two are not managed by the ASR API providers unless you deploy the ASR on-prem. If you waited for Alexa to answer what time it is for almost a minute, then it means you already experienced the impact of connectivity and latency on the response time.

Leopard Speech-to-Text by processing voice data on the device offers guaranteed response time with zero latency, allowing developers to control the first two. Hence, the speed, i.e. duration of processing data to convert voice to text is measured just depends on Real Time Factor (RTF). RTF is the ratio of time taken to transcribe the audio file to the duration of the audio. So, the lower the RTF the better. Although there is no standard, generally a max of 0.5 RTF is preferred. In other words, transcribing hour-long audio should take 30 minutes or less. Leopard achieves 0.05 RTF on an Ubuntu machine with Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz. It means transcribing hour-long audio takes Leopard 3 minutes.

Does Leopard Speech-to-Text run on Linux?

Yes, it does. Picovoice Speech-to-Text engines support all these variants of Linux-based computers. Check out Leopard Speech-to-Text documentation as it’s available through multiple SDKs, including .NET, C, and Python, or the Linux Speech-to-Text blog post, too.

Does Leopard Speech-to-Text run on macOS?

Yes, Leopard Speech-to-Text supports macOS through multiple SDKs, including .NET, C, and Python.

Does Leopard Speech-to-Text run on Windows?

Yes, Leopard enables on-device speech-to-text for any Windows device or application. Check out Leopard Speech-to-Text documentation to get started with your favorite SDK including .NET, C, and Python.

Does Leopard Speech-to-Text run within web browsers such as Chrome, Safari and Firefox?

Yes, Leopard enables local speech-to-text for any modern web browser, including Chrome & Chromium-based browsers, Edge, Firefox, and Safari. Check out Leopard Speech-to-Text Web SDK and get started for free.

Does Leopard Speech-to-Text run on Android?

Yes, Android On-device Speech-to-Text Transcription is available through Leopard Speech-to-Text Android SDK and Leopard Speech-to-Text Flutter SDK.

Does Leopard Speech-to-Text run on iOS?

Yes, iOS On-device Speech-to-Text Transcription is available through Leopard Speech-to-Text Android SDK and Leopard Speech-to-Text Flutter SDK.

Does Leopard Speech-to-Text run on Raspberry Pi?

Yes, Leopard Speech-to-Text can run locally on a Raspberry Pi. Developers can use multiple SDKs, including .NET, C, and Python.

Does Leopard Speech-to-Text convert voice recording to text?

Yes, you can use Leopard Speech-to-Text to convert voice to text. It just takes a few lines of code to get started. Check out Leopard Speech-to-Text docs and start converting voice recordings to text locally on the platform of your choice.

What’s the benefit of Leopard Speech-to-Text over speech-to-text cloud APIs?

Leopard offers local speech-to-text transcription with cloud-level accuracy, while allowing enterprises to have full control over data.

Using transcription APIs means relying on a 3rd party cloud for voice processing unless you can get it deployed on your premises. The cloud comes with costs such as hefty bills at the scale, privacy, and environmental costs with productivity losses due to network outages or delays.

How do I choose the best speech-to-text for my project?

There are several things to be considered while selecting a transcription engine, some are more important for certain use cases and some are not. Although audio transcription software fundamentally converts speech to text, each transcription engine has certain competitive advantages over others. For example, on-device transcription is a better fit for applications that are concerned about privacy, security, and compliance than cloud-dependent APIs.

Which languages does Leopard Speech-to-Text support?

Leopard Speech-to-Text only supports English, French, German, Italian, Portuguese, and Spanish. Reach out to Picovoice Consulting to tell us about your commercial endeavour if you require support for additional languages. Don’t forget to add the use case, business requirements and project details. Picovoice team will respond to you.

Can I use Leopard Speech-to-Text in the field of telephony such as call centers?

Yes. Leopard can be used to transcribe any audio recording including call center conversations. Picovoice software supports 16kHz audio, for 8kHz audio, please contact Picovoice Consulting.

Does Leopard Speech-to-Text support real-time transcription?

Leopard Speech-to-Text doesn’t, but Cheetah Streaming Speech-to-Text does. Learn more about Cheetah Streaming Speech-to-Text.

Was this doc helpful?

Issue with this doc?