How to Build Medical Transcription Software in Python

🎯 Voice AI Consulting

Get dedicated support and consultation to ensure your specific needs are met.

TLDR: Build a real-time, on-device medical transcription system in Python. Use a custom speech-to-text model with clinical vocabulary and evaluate its accuracy using word error rate (WER).

Real-time medical transcription requires both accuracy and speed. However, cloud speech APIs add latency and generic speech recognition models often misinterpret medical terminology or drug names, causing documentation errors that affect patient care. Cheetah Streaming Speech-to-Text addresses these challenges by running fully on-device, enabling fast, HIPAA-compliant transcription with custom vocabulary.

Custom vocabulary lets developers adapt the speech-to-text engine to specialized domains. It defines new terms, abbreviations, or context-specific language that the base model may not recognize.

Train a Custom Medical Speech Recognition Model

Sign up for a Picovoice Console account and navigate to the Leopard & Cheetah page.
Click "New Model", give the model a name, choose the target language, and click "Create Model".
Import the medical-dictionary.yml to add custom vocabulary to the model.

medical-dictionary.yml is a curated medical vocabulary for the real-time transcription model, built with the help of the Common Medical Words dataset. Learn how to generate your own in the Custom Speech-to-Text Model guide.

Test the model using the microphone button.
Download the model.

To further improve accuracy for speech-to-text, you can add boost words to your .yml file. Boost words increase the likelihood of correctly detecting important medical phrases, improving transcription accuracy for frequently used clinical terminology.

Implement the Medical Transcription System in Python

Now that you have the custom medical model downloaded from the Picovoice Console, let's use it to implement a medical transcription system in Python.

Install the Cheetah Python Package

Install the pvcheetah Python package using PIP:

pip install pvcheetah

Python Code for Medical Transcription

This script processes audio with the medical speech-to-text model:

import argparse
import struct
import wave
from pvcheetah import create, CheetahActivationLimitError

def main():
    # Parse command line arguments
    parser = argparse.ArgumentParser()
    parser.add_argument('--access_key', required=True)
    parser.add_argument('--model_path', required=True)
    parser.add_argument('--audio_path', required=True)
    args = parser.parse_args()
    
    # Initialize Cheetah with custom model
    cheetah = create(
        access_key=args.access_key,
        model_path=args.model_path,
        enable_automatic_punctuation=True
    )
    
    try:
        # Open and validate audio file
        with wave.open(args.audio_path, 'rb') as f:
            if f.getframerate() != cheetah.sample_rate:
                raise ValueError(f"Audio must be {cheetah.sample_rate}Hz")
            if f.getnchannels() != 1:
                raise ValueError("Audio must be mono")
            if f.getsampwidth() != 2:
                raise ValueError("Audio must be 16-bit")
            
            # Read and convert audio data
            audio_buffer = f.readframes(f.getnframes())
            audio = struct.unpack(f'{len(audio_buffer) // 2}h', audio_buffer)
        
        # Process audio in chunks
        num_frames = len(audio) // cheetah.frame_length
        for i in range(num_frames):
            frame = audio[i * cheetah.frame_length:(i + 1) * cheetah.frame_length]
            partial_transcript, _ = cheetah.process(frame)
            print(partial_transcript, end='', flush=True)
        
        # Get remaining transcript
        final_transcript = cheetah.flush()
        print(final_transcript)
        
    except CheetahActivationLimitError:
        print('AccessKey limit reached')
    finally:
        # Clean up resources
        cheetah.delete()

if __name__ == '__main__':
    main()

Run the Medical Transcription System

Replace ${ACCESS_KEY} with your AccessKey from the Picovoice Console and update the model and audio paths with your own, using an audio file recorded at 16 kHz, 16-bit, mono:

python medical_transcriber.py \
    --access_key ${ACCESS_KEY} \
    --model_path /path/to/medical_model.pv \
    --audio_path /path/to/audio.wav

Benchmark Medical Transcription Accuracy

To measure transcription accuracy, use Word Error Rate (WER) as the key metric. WER measures overall transcription accuracy by comparing the generated transcripts to a reference transcript. A lower WER means better accuracy.

Python Code to Calculate Transcription Accuracy with WER

Use this Python script to calculate WER:

import argparse

# Calculate minimum edit distance between two word sequences
def edit_distance(ref, hyp):
    r, h = len(ref), len(hyp)
    dp = [[0] * (h + 1) for _ in range(r + 1)]
    # Initialize base cases
    for i in range(r + 1): dp[i][0] = i
    for j in range(h + 1): dp[0][j] = j
    # Fill DP table
    for i in range(1, r + 1):
        for j in range(1, h + 1):
            if ref[i-1] == hyp[j-1]:
                dp[i][j] = dp[i-1][j-1]
            else:
                dp[i][j] = min(dp[i-1][j-1], dp[i-1][j], dp[i][j-1]) + 1
    return dp[r][h]

# Calculate Word Error Rate (WER) as percentage
def calculate_error_rate(ref, hyp):
    ref_words = ref.lower().split()
    hyp_words = hyp.lower().split()
    if not ref_words:
        return 0.0
    return (edit_distance(ref_words, hyp_words) / len(ref_words)) * 100

def main():
    # Parse command line arguments
    parser = argparse.ArgumentParser()
    parser.add_argument('--reference', required=True)
    parser.add_argument('--transcript', required=True)
    args = parser.parse_args()
    
    # Read input files
    ref = open(args.reference).read().strip()
    hyp = open(args.transcript).read().strip()
    
    # Calculate and display WER
    wer = calculate_error_rate(ref, hyp)
    print(f"WER: {wer:.2f}%")

if __name__ == '__main__':
    main()

Run the code for WER Calculation

Save the script as calculate_accuracy.py and run the following command with reference.txt and transcript.txt:

python calculate_accuracy.py \
    --reference reference.txt \
    --transcript transcript.txt \

Custom Medical Transcription Model Performance: Accuracy Comparison Results

We tested both the base and custom medical models on the same medical audio to measure the impact of custom vocabulary. The example uses audio from a medical education video containing clinical terminology, illustrating how the system processes domain-specific speech.

The custom Cheetah Streaming Speech-to-Text medical model achieved a WER of 10.0% compared to the 23.0% WER for the base model. This shows a 57% improvement in transcription accuracy.

Example Transcription Comparison

From the medical education audio, here’s an example sentence that the models transcribed:

Ground Truth:

"Angiotensin-two causes the efferent arteriole to constrict more than afferent arteriole which increases the glomerular filtration rate."

Custom Medical Model:

"Angiotensin-two causes the efferent arteriole to construct more than the afferent arteriole, which increases the glomerular filtration rate."

Base Cheetah Model:

"Angie attention to causes the front arterial to construct more than the apparent arterial, which increases the glomerular filtration rate."

Unlike the base model, the custom medical model correctly identified medical terms such as "angiotensin-two," "efferent arteriole," and "afferent arteriole."

Start Building Real-Time Medical Transcription Software

Ready to build your own HIPAA-compliant medical transcription software? Create a custom model on the Picovoice Console and test it with your domain vocabulary.

Start Building

Frequently Asked Questions

What are the system requirements for Cheetah Streaming Speech-to-Text?

Cheetah Streaming Speech-to-Text requires Python 3.9 or higher and runs on Linux (x86_64), macOS (x86_64, arm64), Windows (x86_64, arm64), Android, iOS, Web and Raspberry Pi (3, 4, 5). The engine processes audio entirely on-device without requiring internet connectivity for transcription, though an internet connection is needed once to validate your AccessKey.

How is Protected Health Information (PHI) handled in medical transcripts generated by Cheetah Streaming Speech-to-Text?

Cheetah Streaming Speech-to-Text transcribes all spoken content including patient names, dates of birth, medical record numbers, and other PHI without automatic redaction, filtering, or de-identification. The engine runs entirely on-device and does not retain audio or transcript data after processing. All PHI remains within the local application environment where the transcription occurs.

Does Cheetah Streaming Speech-to-Text automatically add punctuation to transcripts?

Yes, Cheetah Streaming Speech-to-Text has automatic punctuations. They can be enabled when creating the model instance. Once enabled, the engine will insert punctuation marks (such as periods, commas and question marks) and apply true-casing in the transcript to improve its readability.

What languages are supported for Cheetah Streaming Speech-to-Text?

Cheetah Streaming Speech-to-Text currently supports English, French, German, Italian, Portuguese, and Spanish. Each language has its own base model, and you can add custom vocabulary specific to that language.

Can I update my custom vocabulary without creating a new model?

No. With Cheetah Streaming Speech-to-Text, you need to create a new model version on the Picovoice Console with your updated vocabulary. However, you can test different vocabulary versions by running the same audio through multiple models on the console. You can then download the updated file to replace your existing model.