How to Improve Speech-to-Text Accuracy

🎯 Voice AI Consulting

Get dedicated support and consultation to ensure your specific needs are met.

Maintaining the accuracy of Speech-to-Text (STT) in custom domains is a challenge. Automatic Speech Recognition (ASR) engines that achieve low Word Error Rate (WER) struggle when operating in specialized contexts like medical transcription, sales coaching, and customer service. Lowering back the WER requires engineering efforts. The level of investment depends on the strategy and varies from hours to months of R&D. Below is a pragmatic strategy (flow) starting from low-cost tasks.

Add Custom Words

Every Speech-to-Text engine understands a large but limited set of words known as the Lexicon. A phrase not in the Lexicon is called Out-of-Vocabulary (OOV). When the Lexicon doesn't contain an uttered word, STT transcribes it to another (set of) phrase (s) that sound similar, increasing WER.

If you have an in-house STT engine, add OOVs to your Language Model (LM) pipeline. If you use a third-party vendor, ask if they have an API (facility) to enable you to add OOVs. For example, developers can add custom words (OOVs) via Picovoice Console to Picovoice Leopard Speech-to-Text and Cheetah Streaming Speech-to-Text engines.

Developers can fine-tune speech-to-text AI models by adding custom words on a no-code platform.

Boost Phrases

Some phrases are frequent in a given specialized domain but otherwise are rare. For example, a keynote speaker at Interspeech uses terms like Word Error Rate, Acoustic Model, Language Model, and End-to-End Speech Recognition. But these sequences of words are not frequent in daily conversations.

You can provide these specialized terms (phrases) to the ASR model, so it can closely pay attention to them. If you are building your LM in-house, these terms should be frequent in your Text Corpus. If you are using a third-party vendor, ask them about their API for Keyword Boosting. For example, developers can boost keywords via Picovoice Console to Picovoice Leopard Speech-to-Text and Cheetah Streaming Speech-to-Text engines.

Language Model Adaptation

If adding OOVs and boosting keywords doesn't suffice, adapting (training) the Language Model (LM) is the next logical step. A large text corpus (think millions of words) representing expected utterances is needed. LM Adaptation is accessible if you have an in-house solution or have a strong partnership with your ASR vendor.

Picovoice doesn't provide an automated facility for adapting or training Language Models. The decision is intentional. The steps involved in adapting an LM are complex, and a linguist needs to examine the properties of the adaptation corpus.

Acoustic Model Adaptation

Acoustic Model Adaptation is the last step. It is expensive if done right. But, thankfully, rarely required. There is only a handful of scenarios that require Acoustic Model Adaptation. For example, ASR for kids.

Self-service Picovoice Console is freely available to any developer to improve the accuracy of Leopard Speech-to-Text and Cheetah Streaming Speech-to-Text without writing a single line of code. If you have a domain-specific use case with unique jargon consisting of thousands of words, such as medical dictation, or need an acoustic model adaptation, engage with Picovoice Consulting.

Start Building

How to Improve Speech-to-Text Accuracy

Add Custom Words

Boost Phrases

Language Model Adaptation

Acoustic Model Adaptation

More from Picovoice