TLDR: Build a real-time, on-device medical transcription system in Python. Use a custom speech-to-text model with clinical vocabulary and evaluate its accuracy using word error rate (WER).
Real-time medical transcription requires both accuracy and speed. However, cloud speech APIs add latency and generic speech recognition models often misinterpret medical terminology or drug names, causing documentation errors that affect patient care. Cheetah Streaming Speech-to-Text addresses these challenges by running fully on-device, enabling fast, HIPAA compliant transcription with custom vocabulary.
Custom vocabulary lets developers adapt the speech-to-text engine to specialized domains. It defines new terms, abbreviations, or context-specific language that the base model may not recognize.
To further improve accuracy for speech-to-text, you can add boost words to your .yml file. Boost words increase the likelihood of correctly detecting important medical phrases, improving transcription accuracy for frequently used clinical terminology.
Implement the Medical Transcription System in Python
Now that you have the custom medical model downloaded from the Picovoice Console, let's use it to implement a medical transcription system in Python.
Replace ${ACCESS_KEY} with your AccessKey from the Picovoice Console and update the model and audio paths with your own, using an audio file recorded at 16 kHz, 16-bit, mono:
python medical_transcriber.py \
--access_key ${ACCESS_KEY}\
--model_path /path/to/medical_model.pv \
--audio_path /path/to/audio.wav
Benchmark Medical Transcription Accuracy
To measure transcription accuracy, use Word Error Rate (WER) as the key metric. WER measures overall transcription accuracy by comparing the generated transcripts to a reference transcript. A lower WER means better accuracy.
Python Code to Calculate Transcription Accuracy with WER
Use this Python script to calculate WER:
import argparse
# Calculate minimum edit distance between two word sequences
Save the script as calculate_accuracy.py and run the following command with reference.txt and transcript.txt:
python calculate_accuracy.py \
--reference reference.txt \
--transcript transcript.txt \
Custom Medical Transcription Model Performance: Accuracy Comparison Results
We tested both the base and custom medical models on the same medical audio to measure the impact of custom vocabulary. The example uses audio from a medical education video containing clinical terminology, illustrating how the system processes domain-specific speech.
The custom Cheetah Streaming Speech-to-Text medical model achieved a WER of 10.0% compared to the 23.0% WER for the base model. This shows a 57% improvement in transcription accuracy.
Example Transcription Comparison
From the medical education audio, here’s an example sentence that the models transcribed:
Ground Truth:
"Angiotensin-two causes the efferent arteriole to constrict more than afferent arteriole which increases the glomerular filtration rate."
Custom Medical Model:
"Angiotensin-two causes the efferent arteriole to construct more than the afferent arteriole, which increases the glomerular filtration rate."
Base Cheetah Model:
"Angie attention to causes the front arterial to construct more than the apparent arterial, which increases the glomerular filtration rate."
Unlike the base model, the custom medical model correctly identified medical terms such as "angiotensin-two," "efferent arteriole," and "afferent arteriole."
Start Building Real-Time Medical Transcription Software
Ready to build your own HIPAA-compliant medical transcription software? Create a custom model on the Picovoice Console and test it with your domain vocabulary.
What are the system requirements for Cheetah Streaming Speech-to-Text?
Cheetah Streaming Speech-to-Text requires Python 3.9 or higher and runs on Linux (x86_64), macOS (x86_64, arm64), Windows (x86_64, arm64), Android, iOS, Web and Raspberry Pi (3, 4, 5). The engine processes audio entirely on-device without requiring internet connectivity for transcription, though an internet connection is needed once to validate your AccessKey.
How is Protected Health Information (PHI) handled in medical transcripts generated by Cheetah Streaming Speech-to-Text?
Cheetah Streaming Speech-to-Text transcribes all spoken content including patient names, dates of birth, medical record numbers, and other PHI without automatic redaction, filtering, or de-identification. The engine runs entirely on-device and does not retain audio or transcript data after processing. All PHI remains within the local application environment where the transcription occurs.
Does Cheetah Streaming Speech-to-Text automatically add punctuation to transcripts?
Yes, Cheetah Streaming Speech-to-Text has automatic punctuations. They can be enabled when creating the model instance. Once enabled, the engine will insert punctuation marks (such as periods, commas and question marks) and apply true-casing in the transcript to improve its readability.
What languages are supported for Cheetah Streaming Speech-to-Text?
Cheetah Streaming Speech-to-Text currently supports English, French, German, Italian, Portuguese, and Spanish. Each language has its own base model, and you can add custom vocabulary specific to that language.
Can I update my custom vocabulary without creating a new model?
No. With Cheetah Streaming Speech-to-Text, you need to create a new model version on the Picovoice Console with your updated vocabulary. However, you can test different vocabulary versions by running the same audio through multiple models on the console. You can then download the updated file to replace your existing model.