Leopard Speech-to-Text: Local Transcription FAQ
What is WER? How do you calculate the word error rate?
WER stands for Word Error Rate. The word error rate is calculated by dividing the sum of the errors by the total number of reference words. There are three types of error included in the calculation: substitution (the reference word is replaced by another), insertion (a word is hypothesized that was not in the reference) and deletion (a word in the reference transcription is missed).
What is a good word error rate?
A good error rate depends on your use case. Major cloud providers achieve WER smaller than 25%, while 15% or less is more acceptable by now. However, while evaluating WER, one should be careful with the data set. The performance of the engines varies across datasets. Some vendors may pick the data sets accordingly for marketing purposes. If the errors happen with all the important words for your use case, even if the WER is 1%, it may not mean a good WER for you.
How do I test Leopard Speech-to-Text?
To test how Leopard Speech-to-Text transcribes audio files to text or converts voice recordings to text, check out the Leopard web demo. The demo processes data locally within your web browser.
How do I measure the performance of Leopard Speech-to-Text?
Picovoice publishes open-source, open-data benchmarks to showcase its voice recognition capabilities. Here’s the link for the speech-to-text benchmark.
How do I measure Leopard Speech-to-Text accuracy?
By using the open-data sets [LibriSpeech test-clean, LibriSpeech test-other, Common Voice test, TED-LIUM test]. Leopard achieves 11% WER on average across these four data sets. Check out the open-source speech-to-text benchmark for details.
How do I improve audio transcription accuracy?
It’s a known phenomenon that speech-to-text transcriptions are not good at capturing proper names and homophones. Models should be adapted to the use case to improve transcription software accuracy. Modern audio transcription software offers two options to achieve this: 1) Adding custom vocabulary and 2) Boosting word accuracy. If a use case has heavy use of industry terms, technical jargon or proper names such as personal names or street names, having the flexibility to adapt the base model to improve accuracy is essential. Transcribing “Bob Loblaw” as “blah blah blah” may not be important if you’re transcribing lecture notes where “Bob Loblaw” is used as just a small example. However, it is important if you’re building an application for Fox and Netflix which own the rights of Arrested Development, or Loblaw - the Canadian retailer.
How do I develop a custom speech-to-text model?
Before starting to develop a custom speech-to-text model, the first thing that you need to answer is whether you need a custom model or to adapt an existing model. The general, i.e. base, models of transcription APIs or engines are trained on large amounts of voice data to ensure accuracy across various use cases. [at least the modern ones] In most cases, training a custom model from scratch is not needed. What’s needed is to fine-tune the base model for the use case, as explained in question 5. However, in specific cases, for example, the ones requiring recognition of kids, local dialects or extinct languages, then developing a custom speech-to-text model is needed. They should be considered as developing a new language model.
How long does it take to transcribe 1-hour audio?
It depends on the method and tools used for transcription. It takes human transcribers approximately 4 hours to transcribe hour-long audio. For ASR Cloud APIs, it depends on various factors, including connectivity (how fast audio files are sent for processing), latency and performance of the engine. The first two are not managed by the ASR API providers unless you deploy the ASR on-prem. If you waited for Alexa to answer what time it is for almost a minute, then it means you already experienced the impact of connectivity and latency on the response time. Leopard by processing voice data on the device offers reliable response time and zero latency. The speed, i.e. duration of processing data to convert voice to text depends is measured by Real Time Factor (RTF). RTF is the ratio of time taken to transcribe the audio file to the duration of the audio. So, the lower the RTF the better. Although there is no standard, generally max of 0.5 RTF is preferred. In other words, transcribing hour-long audio should take 30 minutes or less. Leopard achieves 0.05 RTF on an Ubuntu machine with Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz. It means transcribing hour-long audio takes Leopard 3 minutes.
Does Leopard Speech-to-Text run on Linux?
Yes, it does. Check out Leopard Speech-to-Text documentation to get started.
Does Leopard Speech-to-Text run on macOS?
Yes, Leopard Speech-to-Text supports macOS. Check out Leopard Speech-to-Text documentation to get started.
Does Leopard Speech-to-Text run on Windows?
Yes, Window is supported. Check out Leopard Speech-to-Text documentation to get started.
Does Leopard Speech-to-Text run within web browsers such as Chrome, Safari and Firefox?
Yes, Leopard Speech-to-Text supports modern web browsers.
Does Leopard Speech-to-Text run on Android?
Yes, you can build a local transcription product for Android. Check out Leopard Speech-to-Text documentation to get started.
Does Leopard Speech-to-Text run on iOS?
Yes, on-device speech-to-text is also available for iOS. Check out Leopard Speech-to-Text documentation to get started.
Does Leopard Speech-to-Text run on Raspberry Pi?
Yes, Raspberry Pi 3 and 4 are supported to convert voice data to text on-device. Check out Leopard Speech-to-Text documentation to get started.
Is Leopard Speech-to-Text free?
Leopard Speech-to-Text offers free on-device voice recognition for both commercial and non-commercial projects under the Free Plan.
Does Leopard Speech-to-Text convert voice recording to text?
Yes, you can use Leopard Speech-to-Text to convert voice to text, it takes a few lines of code to get started. Check out Leopard docs and start now.
What’s the benefit of Leopard Speech-to-Text over speech-to-text cloud APIs?
Leopard Speech-to-Text brings the control over data and costs back to the enterprises. Using transcription APIs means relying on a 3rd party cloud for voice processing unless you could get it deployed on your premises. The cloud comes with costs such as hefty bills at the scale, privacy and environmental costs with productivity losses due to network outages or delays. Even on-prem deployment does not help with the cost at scale, and audio transcription API providers offer on-prem deployment only to large volume customers who suffer the cost at scale the most...
How do I choose the best speech-to-text for my project?
The best speech-to-text depends on business requirements and resources. While free and open-source audio transcription software converts files to text for free, they require additional resources and the total cost of ownership may be higher than the paid options. Also, every use case requires different features and different levels of support. Check out our blog post on how to evaluate audio transcription engines.
Which languages does Leopard Speech-to-Text support?
Leopard Speech-to-Text only supports English for now, soon more. Reach out to Picovoice Consulting to tell us about your commercial endeavour if you require support for additional languages. Don’t forget to add the use case, business requirements and project details. Picovoice team will respond to you.
Can I use Leopard Speech-to-Text in the field of telephony such as call centers?
Yes. Leopard can be used to transcribe any audio recording including call center conversations. Picovoice software supports 16kHz audio, for 8kHz audio, please contact Picovoice Consulting.
Does Leopard Speech-to-Text support real-time transcription?
Leopard doesn’t, but Cheetah does. Learn more about Cheetah Speech-to-Text.