Question 1

What is WER? How do you calculate the word error rate?

Accepted Answer

WER stands for Word Error Rate. The Word Error Rate (WER) is calculated by dividing the sum of the errors by the total number of reference words. There are three types of error included in the calculation: substitution (the reference word is replaced by another), insertion (a word is hypothesized that was not in the reference), and deletion (a word in the reference transcription is missed).

Despite its limitations, WER is the most commonly used metric to measure speech-to-text engine accuracy. A lower WER (lower number of errors) means better accuracy in recognizing speech. You can check Picovoice’s open-source speech-to-text benchmark to see an application of WER to compare the accuracy of speech-to-text engines.

Question 2

What is a good word error rate?

Accepted Answer

A good error rate depends on your use case. Major cloud providers achieve WER smaller than 25%, while 15% or less is more acceptable now. However, while evaluating WER, one should be careful with the data set. The performance of the engines varies across datasets. Some vendors may pick the data sets accordingly for marketing purposes. If the errors happen with all the important words for your use case, even if the WER is 1%, it may not mean a good WER for you.

Question 3

How do I test Leopard Speech-to-Text?

Accepted Answer

To test how Leopard Speech-to-Text transcribes audio files to text or converts voice recordings to text, check out the Leopard web demo. The demo processes data locally within your web browser.

Question 4

How do I measure the performance of Leopard Speech-to-Text?

Accepted Answer

Picovoice publishes open-source, open-data benchmarks to showcase its voice recognition capabilities. Here’s the link for the speech-to-text benchmark.

Question 5

How do I measure Leopard Speech-to-Text accuracy?

Accepted Answer

Picovoice has built, and open-sourced speech-to-text benchmark in order to enable developers to measure the accuracy of Leopard Speech-to-Text and alternatives. You can reproduce it and use your own data or open-source speech-to-text data sets to evaluate the accuracy.

Question 6

How do I improve audio transcription accuracy?

Accepted Answer

It’s a known phenomenon that general speech-to-text transcriptions are not good at capturing certain things, such as proper names and homophones. Models should be adapted to the use case to improve transcription software accuracy.

Transcribing “Bob Loblaw” as “blah blah blah” may not be important if you’re transcribing lecture notes where “Bob Loblaw” is used as just a small example. However, it is important if you’re building an application for Fox and Netflix which own the rights of Arrested Development, or Loblaw - the Canadian retailer.

There are multiple ways to adapt Automatic Speech Recognition (ASR) to your domain and application, from Adding Custom Words, Boosting Phrases, and Language Model Adaptation to Acoustic Model Adaptation. The level of investment depends on the strategy and requirements. Adding Custom Words and Boosting Phrases on the self-service Picovoice Console does not require any coding experience.

Question 7

How do I develop a custom speech-to-text model?

Accepted Answer

Developing a custom speech-to-text model from scratch is one of the ways to improve the accuracy of Automatic Speech Recognition (ASR). Training a custom speech-to-text model that performs well is a challenging and expensive task. However, in certain cases, such as speech-to-text for kids, it’s required.

You can engage with Picovoice Consulting experts when you become an Enterprise Plan customer to get a custom speech-to-text model trained.

Question 8

How long does it take to transcribe 1-hour audio?

Accepted Answer

It depends on the method and tools used for transcription. It takes human transcribers approximately 4 hours to transcribe hour-long audio. For ASR Cloud APIs, it depends on various factors, including connectivity (how fast audio files are sent for processing), latency, and performance of the engine. The first two are not managed by the ASR API providers unless you deploy the ASR on-prem. If you waited for Alexa to answer what time it is for almost a minute, then it means you already experienced the impact of connectivity and latency on the response time.

Leopard Speech-to-Text by processing voice data on the device offers guaranteed response time with zero latency, allowing developers to control the first two. Hence, the speed, i.e. duration of processing data to convert voice to text is measured just depends on Real Time Factor (RTF). RTF is the ratio of time taken to transcribe the audio file to the duration of the audio. So, the lower the RTF the better. Picovoice's open-source speech-to-text benchmark shows that transcribing an hour long audio takes Leopard 1.56 minutes whereas Whisper Base, which has matching accuracy, takes 19.4 minutes.

Question 9

Does Leopard Speech-to-Text run on Linux?

Accepted Answer

Yes, it does. Picovoice Speech-to-Text engines support all these variants of Linux-based computers. Check out Leopard Speech-to-Text documentation as it’s available through multiple SDKs, including .NET, C, and Python, or the Linux Speech-to-Text blog post, too.

Question 10

Does Leopard Speech-to-Text run on macOS?

Accepted Answer

Yes, Leopard Speech-to-Text supports macOS through multiple SDKs, including .NET, C, and Python.

Question 11

Does Leopard Speech-to-Text run on Windows?

Accepted Answer

Yes, Leopard enables on-device speech-to-text for any Windows device or application. Check out Leopard Speech-to-Text documentation to get started with your favorite SDK including .NET, C, and Python.

Question 12

Does Leopard Speech-to-Text run within web browsers such as Chrome, Safari and Firefox?

Accepted Answer

Yes, Leopard enables local speech-to-text for any modern web browser, including Chrome & Chromium-based browsers, Edge, Firefox, and Safari. Check out Leopard Speech-to-Text Web SDK and get started for free.

Question 13

Does Leopard Speech-to-Text run on Android?

Accepted Answer

Yes, Android On-device Speech-to-Text Transcription is available through Leopard Speech-to-Text Android SDK and Leopard Speech-to-Text Flutter SDK.

Question 14

Does Leopard Speech-to-Text run on iOS?

Accepted Answer

Yes, iOS On-device Speech-to-Text Transcription is available through Leopard Speech-to-Text Android SDK and Leopard Speech-to-Text Flutter SDK.

Question 15

Does Leopard Speech-to-Text run on Raspberry Pi?

Accepted Answer

Yes, Leopard Speech-to-Text can run locally on a Raspberry Pi. Developers can use multiple SDKs, including .NET, C, and Python.

Question 16

Does Leopard Speech-to-Text convert voice recording to text?

Accepted Answer

Yes, you can use Leopard Speech-to-Text to convert voice to text. It just takes a few lines of code to get started. Check out Leopard Speech-to-Text docs and start converting voice recordings to text locally on the platform of your choice.

Question 17

What’s the benefit of Leopard Speech-to-Text over speech-to-text cloud APIs?

Accepted Answer

Leopard offers local speech-to-text transcription with cloud-level accuracy, while allowing enterprises to have full control over data.

Using transcription APIs means relying on a 3rd party cloud for voice processing unless you can get it deployed on your premises. The cloud comes with costs such as hefty bills at the scale, privacy, and environmental costs with productivity losses due to network outages or delays.

Question 18

How do I choose the best speech-to-text for my project?

Accepted Answer

There are several things to be considered while selecting a transcription engine, some are more important for certain use cases and some are not. Although audio transcription software fundamentally converts speech to text, each transcription engine has certain competitive advantages over others. For example, on-device transcription is a better fit for applications that are concerned about privacy, security, and compliance than cloud-dependent APIs.

Question 19

Which languages does Leopard Speech-to-Text support?

Accepted Answer

Leopard Speech-to-Text currently supports English, French, German, Italian, Portuguese, and Spanish.  Contact sales and tell us about your commercial endeavour if you require support for additional languages. Don’t forget to add the use case, business requirements and project details. Picovoice team will respond to you.

Question 20

Can I use Leopard Speech-to-Text in the field of telephony such as call centers?

Accepted Answer

Yes. Leopard can be used to transcribe any audio recording including call center conversations. Picovoice software supports 16kHz audio, for 8kHz audio, please contact sales to get access to a model that processes 8kHz audio.

Question 21

Does Leopard Speech-to-Text support real-time transcription?

Accepted Answer

Leopard Speech-to-Text doesn’t, but Cheetah Streaming Speech-to-Text does. Learn more about Cheetah Streaming Speech-to-Text.

Leopard Speech-to-Text: Local Transcription FAQ