TLDR: Build a real-time, on-device medical transcription system in Python. Use a custom speech-to-text model with clinical vocabulary and evaluate its accuracy using word error rate (WER).
Real-time medical transcription requires both accuracy and speed. However, cloud speech APIs add latency and generic speech recognition models often misinterpret medical terminology or drug names, causing documentation errors that affect patient care. Cheetah Streaming Speech-to-Text addresses these challenges by running fully on-device, enabling fast, HIPAA-compliant transcription with custom vocabulary.
Custom vocabulary lets developers adapt the speech-to-text engine to specialized domains. It defines new terms, abbreviations, or context-specific language that the base model may not recognize.
Train a Custom Medical Speech Recognition Model
- Sign up for a Picovoice Console account and navigate to the Leopard & Cheetah page.
- Click "New Model", give the model a name, choose the target language, and click "Create Model".
- Import the medical-dictionary.yml to add custom vocabulary to the model.
medical-dictionary.yml is a curated medical vocabulary for the real-time transcription model, built with the help of the Common Medical Words dataset. Learn how to generate your own in the Custom Speech-to-Text Model guide.
- Test the model using the microphone button.
- Download the model.
To further improve accuracy for speech-to-text, you can add boost words to your .yml file. Boost words increase the likelihood of correctly detecting important medical phrases, improving transcription accuracy for frequently used clinical terminology.
Implement the Medical Transcription System in Python
Now that you have the custom medical model downloaded from the Picovoice Console, let's use it to implement a medical transcription system in Python.
Install the Cheetah Python Package
Install the pvcheetah Python package using PIP:
Python Code for Medical Transcription
This script processes audio with the medical speech-to-text model:
Run the Medical Transcription System
Replace ${ACCESS_KEY} with your AccessKey from the Picovoice Console and update the model and audio paths with your own, using an audio file recorded at 16 kHz, 16-bit, mono:
Benchmark Medical Transcription Accuracy
To measure transcription accuracy, use Word Error Rate (WER) as the key metric. WER measures overall transcription accuracy by comparing the generated transcripts to a reference transcript. A lower WER means better accuracy.
Python Code to Calculate Transcription Accuracy with WER
Use this Python script to calculate WER:
Run the code for WER Calculation
Save the script as calculate_accuracy.py and run the following command with reference.txt and transcript.txt:
Custom Medical Transcription Model Performance: Accuracy Comparison Results
We tested both the base and custom medical models on the same medical audio to measure the impact of custom vocabulary. The example uses audio from a medical education video containing clinical terminology, illustrating how the system processes domain-specific speech.
The custom Cheetah Streaming Speech-to-Text medical model achieved a WER of 10.0% compared to the 23.0% WER for the base model. This shows a 57% improvement in transcription accuracy.
Example Transcription Comparison
From the medical education audio, here’s an example sentence that the models transcribed:
- Ground Truth:
- Custom Medical Model:
- Base Cheetah Model:
Unlike the base model, the custom medical model correctly identified medical terms such as "angiotensin-two," "efferent arteriole," and "afferent arteriole."
Start Building Real-Time Medical Transcription Software
Ready to build your own HIPAA-compliant medical transcription software? Create a custom model on the Picovoice Console and test it with your domain vocabulary.
Start BuildingFrequently Asked Questions
Cheetah Streaming Speech-to-Text requires Python 3.9 or higher and runs on Linux (x86_64), macOS (x86_64, arm64), Windows (x86_64, arm64), Android, iOS, Web and Raspberry Pi (3, 4, 5). The engine processes audio entirely on-device without requiring internet connectivity for transcription, though an internet connection is needed once to validate your AccessKey.
Cheetah Streaming Speech-to-Text transcribes all spoken content including patient names, dates of birth, medical record numbers, and other PHI without automatic redaction, filtering, or de-identification. The engine runs entirely on-device and does not retain audio or transcript data after processing. All PHI remains within the local application environment where the transcription occurs.
Yes, Cheetah Streaming Speech-to-Text has automatic punctuations. They can be enabled when creating the model instance. Once enabled, the engine will insert punctuation marks (such as periods, commas and question marks) and apply true-casing in the transcript to improve its readability.
Cheetah Streaming Speech-to-Text currently supports English, French, German, Italian, Portuguese, and Spanish. Each language has its own base model, and you can add custom vocabulary specific to that language.
No. With Cheetah Streaming Speech-to-Text, you need to create a new model version on the Picovoice Console with your updated vocabulary. However, you can test different vocabulary versions by running the same audio through multiple models on the console. You can then download the updated file to replace your existing model.







