Complete Guide to Building HIPAA-Compliant Medical Voice AI Agent

🎯 Voice AI Consulting

Get dedicated support and consultation to ensure your specific needs are met.

TLDR: Build voice AI agents for patient triage, appointment scheduling, and medical billing using Python with on-device speech processing. This tutorial implements a HIPAA-compliant medical voice assistant with wake word detection, real-time speech-to-text optimized for clinical vocabulary, and voice synthesis.

Cloud-based voice AI creates deployment challenges for healthcare: 1-2 seconds of latency disrupts conversation flow, and transmitting patient voice data requires extensive HIPAA compliance infrastructure. The 2024 Perry Johnson & Associates breach exposed 4 million patient records despite compliance measures, demonstrating the risks of centralized cloud storage.

On-device speech processing provides a secure alternative. By handling audio entirely on the edge and transmitting only anonymized text for reasoning, this approach minimizes network latency while ensuring Protected Health Information (PHI) remains secured within the local infrastructure, mitigating cloud compliance risks.

This tutorial demonstrates this hybrid edge architecture by building a fully functional triage agent. Wake word detection, speech recognition, and voice synthesis run entirely on-device, while GPT-4 is used strictly for medical reasoning on sanitized text.

What You'll Build:

Wake word activation ("Hey Doctor") for hands-free operation
Real-time speech transcription optimized for medical vocabulary
GPT-4 reasoning layer
Natural voice synthesis for patient responses
Complete Python implementation deployable on edge devices

What You'll Need:

Python 3.9+
Microphone and speakers
Picovoice AccessKey from the Picovoice Console
OpenAI API key from the OpenAI Platform page

System Architecture

The medical triage agent operates on a strict privacy-first pipeline, ensuring patient audio never leaves the device.

Wake Word Detection: The system remains in a passive listening state using Porcupine Wake Word, waiting for a specific medical phrase (e.g., "Hey Doctor") to trigger activation without sending audio to the cloud.
Local Transcription: Once triggered, Cheetah Streaming Speech-to-Text transcribes the patient's speech in real-time. This engine is optimized with a custom vocabulary to accurately capture clinical terminology, achieving higher accuracy than generic models.
Sanitization & Reasoning: A local function strips Personally Identifiable Information or PII (names, dates) from the transcript before sending only the anonymized text to the OpenAI API for medical assessment.
Voice Response: The text-based triage advice is converted back into natural speech using Orca Streaming Text-to-Speech and played to the patient.

For deployments requiring zero cloud transmission of patient data, replace OpenAI with picoLLM to run the reasoning layer entirely on-device.

Create Custom Wake Word for Medical Assistant

Sign up for a Picovoice Console account and navigate to the Porcupine page.
Enter your wake phrase such as "Hey Doctor" and test it using the microphone button.
Click "Train", select the target platform, and download the .ppn model file.

For tips on designing an effective wake word, review the choosing a wake word guide.

Optimizing Speech Recognition for Clinical Vocabulary

Sign up for a Picovoice Console account and navigate to the Leopard & Cheetah page.
Click "New Model", give the model a name, choose the target language, and click "Create Model".
Import the medical-dictionary.yml to add custom vocabulary to the model.

medical-dictionary.yml is a curated medical vocabulary for the real-time transcription model, built with the help of the Common Medical Words dataset. Learn how to generate your own in the Custom Speech-to-Text Model guide.

Test the model using the microphone button.
Download the model.

To further improve accuracy for speech-to-text, you can add boost words to your .yml file. Boost words increase the likelihood of correctly detecting important medical phrases, improving transcription accuracy for frequently used clinical terminology.

Install Python Dependencies

Install all required Python SDKs and dependencies with a single terminal command:

Porcupine Wake Word Python SDK: pvporcupine
Cheetah Streaming Speech-to-Text Python SDK: pvcheetah
Orca Text-to-Speech Python SDK: pvorca
Picovoice Python Recorder library: pvrecorder
Picovoice Python Speaker library:pvspeaker
OpenAI Python library: openai — used for ChatGPT's OpenAI API integration.

pip install pvporcupine pvcheetah pvorca pvrecorder pvspeaker openai

Implement Wake Word Detection

Implement wake word detection to activate the agent hands-free:

import pvporcupine
from pvrecorder import PvRecorder

ACCESS_KEY = "${ACCESS_KEY}"
KEYWORD_PATH = "${KEYWORD_PATH}"

porcupine = pvporcupine.create(
    access_key=ACCESS_KEY,
    keyword_paths=[KEYWORD_PATH])

recorder = PvRecorder(frame_length=porcupine.frame_length)
recorder.start()

print('Listening for wake word...')

while True:
    pcm = recorder.read()
    result = porcupine.process(pcm)
    if result >= 0:
        print('Wake word detected!')
        break

# Cleanup
recorder.delete()
porcupine.delete()

Add Real-Time Medical Speech Recognition

Once the wake word has been detected, capture audio frames and transcribe them in real-time with Cheetah Streaming Speech-to-Text:

import pvcheetah
from pvrecorder import PvRecorder

ACCESS_KEY = "${ACCESS_KEY}"
MODEL_PATH = "${MODEL_PATH}"

cheetah = pvcheetah.create(
    access_key=ACCESS_KEY,
    model_path=MODEL_PATH,
    endpoint_duration_sec=1.0,
    enable_automatic_punctuation=True)

recorder = PvRecorder(frame_length=cheetah.frame_length)
recorder.start()

print('Listening...')

while True:
    partial_transcript, is_endpoint = cheetah.process(recorder.read())
    print(partial_transcript, end='', flush=True)
    if is_endpoint:
        print(cheetah.flush())
        break

# Cleanup
recorder.delete()
cheetah.delete()

Implement AI-Powered Symptom Assessment

The triage engine takes the transcribed text output from Cheetah, sanitizes it to remove PII, and sends the anonymized query to the LLM for reasoning. The following code strips any personally identifiable information to protect patient privacy.

import openai
from openai import OpenAI
import re

OPENAI_API_KEY = "${OPENAI_API_KEY}"
client = OpenAI(api_key=OPENAI_API_KEY)

TRIAGE_SYSTEM_PROMPT = """You are a medical triage assistant. Classify patient symptoms into urgency levels:

EMERGENCY: Life-threatening - requires immediate 911/ER (chest pain, difficulty breathing, stroke symptoms)
URGENT: Serious - requires same-day medical attention (high fever, severe pain, infection signs)
ROUTINE: Non-urgent - schedule regular appointment (minor rash, mild symptoms)

Provide:
1. Classification level
2. Brief reasoning
3. Specific next steps

Always err on the side of caution."""

def strip_pii(transcript):    
    """
    Remove personally identifiable information from transcript.
    This is a simplified example for demonstration.
    Production systems require comprehensive identifier detection
    """
    # Remove common PII patterns (names, phone numbers, addresses, etc.)
    # This is a simplified example - production systems need comprehensive PII detection
    text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', transcript)
    text = re.sub(r'\b\d{5}\b', '[ZIP]', text)
    return text

def assess_symptoms(patient_transcript):
    """Analyze patient symptoms and provide triage recommendation."""
    try:
        # Strip PII before sending to cloud
        sanitized_transcript = strip_pii(patient_transcript)
        
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": TRIAGE_SYSTEM_PROMPT},
                {"role": "user", "content": f"Patient says: {sanitized_transcript}"}
            ],
            temperature=0.3,  # Lower for consistent medical advice
            max_tokens=300
        )
        
        assessment = response.choices[0].message.content
        
        # Parse classification from response
        classification = "ROUTINE"  # Default safe escalation
        if "EMERGENCY" in assessment.upper():
            classification = "EMERGENCY"
        elif "URGENT" in assessment.upper():
            classification = "URGENT"
        
        return {
            "classification": classification,
            "assessment": assessment,
            "transcript": sanitized_transcript
        }
        
    except Exception as e:
        print(f"Triage error: {e}")
        # Safe default on error
        return {
            "classification": "URGENT",
            "assessment": "Unable to complete assessment. Please seek medical attention.",
            "transcript": strip_pii(patient_transcript)
        }

# Example usage
# patient = "I have severe chest pain and difficulty breathing. Call me back at 555-123-4567."
# result = assess_symptoms(patient)
# {
#     "classification": "EMERGENCY",
#     "assessment": "EMERGENCY: Chest pain with difficulty breathing requires immediate evaluation...",
#     "transcript": "I have severe chest pain and difficulty breathing. Call me back at [PHONE]."
# }

The modular architecture allows swapping GPT-4 for different reasoning models based on your deployment requirements.

Add Text-to-Speech Voice Responses

Convert triage assessments into natural speech:

import pvorca
from pvspeaker import PvSpeaker
from collections import deque

ACCESS_KEY = "${ACCESS_KEY}"

orca = pvorca.create(access_key=ACCESS_KEY)
speaker = PvSpeaker(
    sample_rate=orca.sample_rate,
    bits_per_sample=16)

# Synthesize speech
# assessment_text = "This requires immediate emergency care. Please call 911."
assessment_text = "Your symptoms require prompt medical attention today."
pcm, alignments = orca.synthesize(assessment_text)

# Play audio
speaker.start()

pcm_buffer = deque()
pcm_buffer.append(pcm)

while len(pcm_buffer) > 0:
    pcm = pcm_buffer.popleft()
    written = speaker.write(pcm)
    if written < len(pcm):
        pcm_buffer.appendleft(pcm[written:])

speaker.flush()
speaker.stop()

# Cleanup
speaker.delete()
orca.delete()

Orca Streaming Text-to-Speech provides high-quality voice synthesis that runs entirely on-device, keeping patient interactions private.

Complete Medical Triage Voice Agent Code

Here's the full implementation combining all components:

import argparse
import re
from collections import deque

import pvporcupine
from pvcheetah import create as create_cheetah, CheetahActivationLimitError
import pvorca
from pvrecorder import PvRecorder
from pvspeaker import PvSpeaker
from openai import OpenAI


class MedicalTriageAgent:
    def __init__(self, access_key, openai_key, keyword_path, cheetah_model_path=None):
        """Initialize medical triage voice agent."""
        print('Initializing Medical Triage Agent...')
        
        self.porcupine = pvporcupine.create(
            access_key=access_key,
            keyword_paths=[keyword_path])
        
        self.cheetah = create_cheetah(
            access_key=access_key,
            model_path=cheetah_model_path,
            endpoint_duration_sec=1.0,
            enable_automatic_punctuation=True)
        
        self.orca = pvorca.create(access_key=access_key)
        
        self.speaker = PvSpeaker(
            sample_rate=self.orca.sample_rate,
            bits_per_sample=16,
            buffer_size_secs=20,
            device_index=-1)
        
        self.openai_client = OpenAI(api_key=openai_key)
        
        print('Porcupine version: %s' % self.porcupine.version)
        print('Cheetah version: %s' % self.cheetah.version)
        print('Orca version: %s' % self.orca.version)
        print('System ready!')
    
    def wait_for_wake_word(self):
        """Listen for wake word activation."""
        recorder = PvRecorder(
            frame_length=self.porcupine.frame_length,
            device_index=-1)
        recorder.start()
        
        print('Listening for wake word...')
        
        try:
            while True:
                pcm = recorder.read()
                result = self.porcupine.process(pcm)
                if result >= 0:
                    print('Wake word detected!')
                    return True
        except KeyboardInterrupt:
            return False
        finally:
            recorder.delete()
    
    def transcribe_patient(self):
        """Capture and transcribe patient speech."""
        recorder = PvRecorder(
            frame_length=self.cheetah.frame_length,
            device_index=-1)
        recorder.start()
        
        print('Listening... (speak now)')
        transcript = ''
        
        try:
            while True:
                partial_transcript, is_endpoint = self.cheetah.process(recorder.read())
                print(partial_transcript, end='', flush=True)
                transcript += partial_transcript
                if is_endpoint:
                    final = self.cheetah.flush()
                    print(final)
                    transcript += final
                    break
            
            print(f'\n[TRANSCRIPT] {transcript}')
            return transcript.strip()
            
        except CheetahActivationLimitError:
            print('AccessKey has reached its processing limit.')
            return ''
        finally:
            recorder.stop()
            recorder.delete()
    
    def strip_pii(self, transcript):
        """
        Remove personally identifiable information from transcript.
        This is a simplified example for demonstration.
        Production systems require comprehensive identifier detection
        """
        text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', transcript)
        text = re.sub(r'\b\d{5}\b', '[ZIP]', text)
        return text
    
    def assess_symptoms(self, transcript):
        """Analyze symptoms using GPT-4."""
        system_prompt = """You are a medical triage assistant. Classify symptoms as:
        EMERGENCY (call 911), URGENT (same-day care) or ROUTINE (schedule appointment).
        Provide brief reasoning and next steps. Always err on caution."""
        
        try:
            sanitized = self.strip_pii(transcript)
            
            response = self.openai_client.chat.completions.create(
                model='gpt-4',
                messages=[
                    {'role': 'system', 'content': system_prompt},
                    {'role': 'user', 'content': f'Patient: {sanitized}'}
                ],
                temperature=0.3,
                max_tokens=300)
            
            assessment = response.choices[0].message.content
            
            classification = 'ROUTINE'
            for level in ['EMERGENCY', 'URGENT', 'ROUTINE']:
                if level in assessment.upper():
                    classification = level
                    break
            
            return {'classification': classification, 'assessment': assessment}
            
        except Exception as e:
            print(f'Assessment error: {e}')
            return {
                'classification': 'URGENT',
                'assessment': 'Unable to assess. Please seek medical attention.'}
    
    def speak(self, text):
        """Synthesize and play speech."""
        try:
            pcm, alignments = self.orca.synthesize(text)
            
            self.speaker.start()
            
            pcm_buffer = deque()
            pcm_buffer.append(pcm)
            
            while len(pcm_buffer) > 0:
                pcm = pcm_buffer.popleft()
                written = self.speaker.write(pcm)
                if written < len(pcm):
                    pcm_buffer.appendleft(pcm[written:])
            
            self.speaker.flush()
            self.speaker.stop()
            
        except Exception as e:
            print(f'TTS error: {e}')
    
    def format_response(self, result):
        """Format triage result for voice response."""
        classification = result['classification']
        
        responses = {
            'EMERGENCY': 'This requires immediate emergency care. Please call 911 or go to the nearest ER right away.',
            'URGENT': 'Your symptoms require prompt medical attention today. Please contact your doctor immediately.',
            'ROUTINE': 'Please schedule an appointment with your doctor at your earliest convenience.',}
        
        return f"{responses.get(classification, responses['ROUTINE'])} {result['assessment']}"
    
    def run_session(self):
        """Execute one complete triage session."""
        self.speak('Say the wake word when you need help.')
        if not self.wait_for_wake_word():
            return
        
        self.speak('Please describe your symptoms.')
        transcript = self.transcribe_patient()
        
        if not transcript:
            self.speak("An error occurred.")
            return
        
        print('Analyzing symptoms...')
        result = self.assess_symptoms(transcript)
        print(f"[CLASSIFICATION] {result['classification']}")
        
        response = self.format_response(result)
        print(f'[RESPONSE] {response}')
        self.speak(response)
    
    def cleanup(self):
        """Clean up resources."""
        self.porcupine.delete()
        self.cheetah.delete()
        self.orca.delete()
        self.speaker.delete()


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--access_key', required=True, help='AccessKey obtained from Picovoice Console (https://console.picovoice.ai/)')
    parser.add_argument('--openai_key', required=True, help='OpenAI API key')
    parser.add_argument('--keyword_path', required=True, help='Absolute path to Porcupine keyword model file (.ppn)')
    parser.add_argument('--cheetah_model_path', required=True, help='Absolute path to custom Cheetah model file (.pv)')
    args = parser.parse_args()

    agent = MedicalTriageAgent(
        access_key=args.access_key,
        openai_key=args.openai_key,
        keyword_path=args.keyword_path,
        cheetah_model_path=args.cheetah_model_path)

    try:
        print('Starting triage service (Ctrl+C to stop)\n')
        while True:
            agent.run_session()
            print('\n' + '=' * 50 + '\n')
    except KeyboardInterrupt:
        print('\nShutting down...')
    finally:
        agent.cleanup()


if __name__ == '__main__':
    main()

Run the Medical Triage Agent

You will need the Picovoice AccessKey to use the SDKs. Copy it from the Picovoice Console.

Run the following command in your terminal. Replace the placeholder values with your own ACCESS_KEY, OPENAI_KEY and the file paths to your models.

python3 medical-triage-agent.py \
    --access_key="${ACCESS_KEY}" \
    --openai_key="${OPENAI_KEY}" \
    --keyword_path="${KEYWORD_PATH}" \
    --cheetah_model_path="${CHEETAH_MODEL_PATH}"

The medical triage agent is now ready and listening for the wake word.

Example: Emergency Symptom Detection

Patient: "I'm having severe chest pain and pressure. It started 20 minutes ago and I feel like I can't catch my breath."

Classification: EMERGENCY
Response: "This requires immediate emergency care. Please call 911 or go to the nearest ER right away. Chest pain with breathing difficulty can indicate a serious cardiac condition."

Medical Voice AI for Appointment Scheduling, Billing, and Prescription Refills

The voice AI architecture demonstrated in this tutorial can be adapted for various medical applications beyond triage. Here are common implementations using the same core pipeline:

Appointment Scheduling Agent

Purpose: Automates patient appointment booking, rescheduling, and cancellations through natural conversation.

Implementation approach:

Integrate with calendar/scheduling APIs (custom EHR systems)
Add slot availability checking and confirmation workflows

Example interaction:

Patient: "I need to schedule a physical exam with Dr. Smith next week."
Agent: "I can help you book that appointment. Dr. Smith has availability on Tuesday at 2 PM or Thursday at 10 AM. Which works better for you?"

Billing Support Agent

Purpose: Handles insurance inquiries, payment processing, and billing questions.

Implementation approach:

Integrate with billing systems and payment processors
Implement secure payment collection workflows

Example interaction:

Patient: "What's my current balance and can I set up a payment plan?"
Agent: "Your current balance is $450. I can set up a monthly payment plan for you. Would you prefer 3, 6, or 12 months?"

Prescription Refill Agent

Purpose: Automates prescription refill requests and pharmacy coordination.

Implementation approach:

Integrate with pharmacy management systems
Verify patient medication lists and refill eligibility through your EHR

Example interaction:

Patient: "I need to refill my blood pressure medication."
Agent: "I can help with that. Your lisinopril prescription is eligible for refill. Would you like me to send it to your usual pharmacy on Main Street?"

Medical Records Request Agent

Purpose: Processes requests for medical records, test results, and documentation.

Implementation approach:

Integrate with EHR systems for record retrieval
Provide secure delivery options (patient portal, fax, mail)

Example interaction:

Patient: "I need copies of my lab results from last week sent to my specialist."
Agent: "I can request those records for you. For security purposes, I'll need to verify your date of birth and send a confirmation to the phone number we have on file."

Each of these implementations uses the same core voice pipeline with domain-specific system prompts, custom vocabulary for their domain, and integrations with relevant healthcare systems.

Next Steps: Customization and Integration

Multi-Language Support: Picovoice supports multiple languages for all models. Get language-specific Porcupine, Cheetah and Orca for different patient populations.

Interactive Follow-Up: Add follow-up questions based on initial symptoms. For instance, if a patient mentions pain, ask about severity on a 1-10 scale. If they mention fever, ask about temperature readings.

EHR Integration: Connect with electronic health records to access patient history and save triage assessments. In production deployments, the system would query your EHR's API for relevant medical history and write back triage results as encounter notes.

Phone System Integration: Integrate with existing phone infrastructure using Twilio or similar services. Incoming calls trigger the triage agent, and responses are delivered through the phone system's audio interface.

Multi-Agent Systems: Combine multiple agent types into a unified voice AI system. A patient could start with triage, get routed to appointment scheduling, and finish with billing questions—all in one continuous conversation.

Frequently Asked Questions

What happens if the OpenAI API calls fail or timeout during a patient conversation?

Network timeouts, rate limits, or API outages can cause the GPT-4 request to fail. You can wrap OpenAI calls in try-except blocks and use Orca Streaming Text-to-Speech to provide voice feedback like "I'm experiencing technical difficulties, please hold" or "I apologize for the delay." Since Porcupine Wake Word and Cheetah Streaming Speech-to-Text process audio entirely on-device, the voice interface remains functional during API failures, only the triage reasoning becomes unavailable.

How does the system maintain patient privacy and HIPAA compliance?

All audio containing Protected Health Information (PHI) is processed locally where Cheetah Streaming Speech-to-Text runs. The speech engine transcribes voice data without sending raw audio to external servers. Only the transcribed text (with personally identifiable information stripped) reaches GPT-4 for triage decisions. Audio data never leaves the local application environment, addressing a core HIPAA requirement around PHI transmission.

Can I customize the wake word and triage prompts for my specific practice?

You can train any custom wake word using Picovoice Console. Use your practice name or branding (e.g., "Hey City Medical" or "Health Partners Assistant"). For triage prompts, modify the system prompt to include your practice's specific protocols, provider names, office hours, and operational procedures. You can adjust classification thresholds and response templates to match your workflow.

Complete Guide to Building HIPAA-Compliant Medical Voice AI Agent

System Architecture

Create Custom Wake Word for Medical Assistant

Optimizing Speech Recognition for Clinical Vocabulary

Install Python Dependencies

Implement Wake Word Detection

Add Real-Time Medical Speech Recognition

Implement AI-Powered Symptom Assessment

Add Text-to-Speech Voice Responses

Complete Medical Triage Voice Agent Code

Run the Medical Triage Agent

Example: Emergency Symptom Detection

Medical Voice AI for Appointment Scheduling, Billing, and Prescription Refills

Appointment Scheduling Agent

Billing Support Agent

Prescription Refill Agent

Medical Records Request Agent

Next Steps: Customization and Integration

Frequently Asked Questions

More from Picovoice