Get Word-Level Confidence in Speech-to-Text with Python

🎯 Voice AI Consulting

Get dedicated support and consultation to ensure your specific needs are met.

TLDR: Word-level confidence scores tell you how certain a speech-to-text model is about each transcribed word. This tutorial shows you how to get word-level confidence scores in Python and how to set word-confidence thresholds to detect uncertain words.

What is a Confidence Score?

A confidence score is a numerical probability value that indicates how certain a machine learning model is about its prediction. In the context of speech recognition, confidence scores help measure the reliability of the model's output and identify potential errors in automated transcriptions.

What is a Word-Level Confidence Score?

Word-level confidence scores are probability scores represented as numerical values between 0.0 and 1.0 that indicate how certain an ASR model is about each transcribed word. These per-word confidence scores help estimate word-level reliability and improve transcription quality:

0.0: Lowest confidence (very uncertain)
1.0: Highest confidence (very certain)

Confidence scores vary across Automatic Speech Recognition (ASR) engines and indicate model certainty rather than guaranteed accuracy. An ASR model might be 90% accurate but have poorly calibrated confidence scores. Different speech-to-text providers may return different word-level confidence values for the same transcribed word, and some systems like open-source Whisper don't provide confidence scores at all.

Why Confidence Scores Matter?

Word-level confidence scores can improve an application's reliability and user experience. For example, A voice assistant can use low word-level confidence scores to ask "Did you say...?" instead of executing an uncertain command, or a transcription software can flag uncertain words for human review to improve transcription quality and accuracy.

Implement Word-Level Confidence Scores in Python

What You'll Need:

Python 3.9+
An audio file to transcribe (WAV, MP3, FLAC, or other common formats)
Picovoice AccessKey from the Picovoice Console

What You'll Build:

Python script to get transcriptions with word-level confidence scores
Python script to detect low-confidence words using a custom threshold

Install the Speech-to-Text SDK

Install the speech-to-text SDK using pip:

pip install pvleopard

Full Code to Get Word-Level Confidence Scores in Python

Here's the complete working code that transcribes an audio file and displays each word with its individual confidence score:

import pvleopard
import argparse

def transcribe_with_confidence(access_key, audio_path):
    leopard = pvleopard.create(access_key=access_key)
    transcript, words = leopard.process_file(audio_path)
    
    print("Final Transcript:")
    print(transcript)
    print("\nWord-Level Details:")
    for word in words:
        print(f"{word.word:<20} {word.confidence:.2f}")
    
    leopard.delete()

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--access_key", required=True, help="Picovoice AccessKey")
    parser.add_argument("--audio_path", required=True, help="Path to audio file")
    args = parser.parse_args()
    
    transcribe_with_confidence(args.access_key, args.audio_path)

if __name__ == "__main__":
    main()

Run the Word Confidence Python Script

Save the code as word_confidence.py and run it from the command line with your Picovoice AccessKey and audio file path:

python word_confidence.py --access_key ${YOUR_ACCESS_KEY_HERE} --audio_path path/to/your/audio.wav

Python Code Output: Word-Level Confidence Scores

Final Transcript:
the quick brown fox jumps over the lazy dog

Word-Level Details:
the                  0.99
quick                0.92
brown                0.89
fox                  0.95
jumps                0.78
over                 0.94
the                  0.97
lazy                 0.88
dog                  0.96

Each word includes a confidence score (0.0 to 1.0) associated with it. Lower confidence scores indicate uncertain recognition and potentially misrecognized words; higher confidence scores indicate confident recognition and greater reliability.

Detect Low-Confidence Words in Speech-to-Text (Word Confidence Threshold)

One of the most practical uses of word-level confidence scores is identifying uncertain words that may need review. Here's the Python code to automatically flag low-confidence words using a custom confidence threshold:

import pvleopard
import argparse

CONFIDENCE_THRESHOLD = 0.90

def identify_low_confidence_words(access_key, audio_path, threshold=CONFIDENCE_THRESHOLD):
    leopard = pvleopard.create(access_key=access_key)
    transcript, words = leopard.process_file(audio_path)
    
    low_confidence_words = [word for word in words if word.confidence < threshold]
    
    if low_confidence_words:
        print(f"Found {len(low_confidence_words)} word(s) below {threshold}:")
        for word in low_confidence_words:
            print(f"  '{word.word}' - {word.confidence:.2f} (at {word.start_sec:.2f}s)")
    else:
        print(f"All words above {threshold}")
    
    leopard.delete()

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--access_key", required=True, help="Picovoice access key")
    parser.add_argument("--audio_path", required=True, help="Path to audio file")
    args = parser.parse_args()
    
    identify_low_confidence_words(args.access_key, args.audio_path)

if __name__ == "__main__":
    main()

The confidence threshold can be set based on the application's tolerance for errors: high-stakes applications like medical or legal transcription can have higher thresholds to catch more potential errors for review and ensure word-level accuracy, while casual applications like notes or captions can use relatively lower thresholds to balance between flagging uncertain words and maintaining workflow efficiency.

Run the Word Confidence Threshold Detection Code

Save the code as low_confidence_detector.py and run it from the command line with your AccessKey and audio file path

python low_confidence_detector.py --access_key ${YOUR_ACCESS_KEY_HERE} --audio_path path/to/your/audio.wav

Python Code Output: Word Confidence Threshold

Found 3 word(s) below 0.9:
  'brown' - 0.89 (at 0.42s)
  'jumps' - 0.78 (at 0.96s)
  'lazy' - 0.88 (at 1.58s)

The script filters words below the 0.90 confidence threshold and displays each flagged word with its confidence score and timestamp. This makes it easy to locate uncertain words in the audio for manual review or further processing.

Leopard Speech-to-Text offers additional production-ready features like speaker diarization, timestamps and automatic punctuation. Explore all speech-to-text features to get reliable and high quality transcriptions.

Start building production-ready speech recognition applications with word-level confidence scores for better transcription quality control today!

Start Free