Smart IVR: Python Tutorial for AI Call Center Automation

🚀 On-device Voice AI & LLMs

Build commercial, non-commercial, research projects using the Forever-Free Plan.

TLDR: Build a smart IVR system for call center automation in Python. This tutorial shows how to implement low-latency conversational IVR with intent recognition, intelligent call routing, and LLM reasoning for AI-powered customer service automation.

Why Smart IVR Systems Matter for Contact Center Automation

A smart IVR (Interactive Voice Response) uses voice AI to understand speech, route calls intelligently, and resolve customer requests without rigid menu trees or keypad inputs. Unlike traditional IVR systems that rely on fixed flows, smart IVRs combine speech recognition, intent detection, and AI-driven reasoning to handle requests dynamically and reduce friction in customer interactions.

In live customer service calls, even small delays compound quickly and degrade the overall experience. Callers navigate numbered options, repeat information multiple times, and wait through network round-trips that add 1–2 seconds of latency per interaction. Cloud-based voice APIs compound this latency. For example, Amazon’s cloud STT and TTS add substantial processing time: automatic speech recognition takes 920 ms with Amazon Transcribe Streaming, and speech synthesis adds 1540 ms with Amazon Polly. Additionally, text-based intent classification adds an extra transcription step compared to direct speech-to-intent pipelines, increasing end-to-end latency in conversational IVR.

Running speech recognition, intent detection, and language model reasoning locally within the IVR application server eliminates cloud speech API round-trips and delivers faster, more predictable response times.

This tutorial shows how to build a Python IVR system for an AI call center that routes customer service queries between intent recognition and LLM reasoning. It uses voice AI models that can run locally on the IVR application server without cloud API dependencies. The implementation consists of Cobra Voice Activity Detection for voice activation and Rhino Speech-to-Intent for intent recognition. For complex queries, it uses Cheetah Streaming Speech-to-Text and picoLLM while responses are generated with Orca Streaming Text-to-Speech.

Picovoice AI models can run on-prem, in the cloud, and on-device across platforms including Linux, macOS, Windows, Android, iOS, and web browsers.

What You'll Build:

A conversational IVR system that:

Detects caller speech activity to avoid processing silence
Handles common queries instantly using speech-to-intent recognition
Routes unrecognized queries to an LLM for reasoning
Responds with natural speech synthesis

What You'll Need:

Python 3.9+
A desktop or laptop with microphone and speakers for testing
Picovoice AccessKey from the Picovoice Console

This tutorial focuses on the speech processing and call routing logic. In production, the same pipeline typically runs on an IVR application server (cloud, on-premises, or private infrastructure) that receives audio streams from a telephony system and returns prompts or routing decisions.

Smart IVR Architecture: Intelligent Call Routing with Speech Recognition

The smart IVR system uses the following approach to handle customer queries efficiently:

Voice Activity Detection: Cobra Voice Activity Detection monitors the audio stream and detects when the caller begins speaking. This prevents the system from routing silence or background noise through the speech recognition pipeline.

Intent Recognition: When a customer speaks, Rhino Speech-to-Intent processes the audio directly. If the customer service voicebot recognizes a known intent with required parameters (e.g., "check order status for order 12345"), it responds immediately. This handles the majority of routine customer service queries with minimal latency.

LLM Reasoning: If Rhino returns is_understood=False for ambiguous or complex queries (e.g., "why was I charged twice when I cancelled my order?"), the system prompts the customer to provide more details, then uses Cheetah Streaming Speech-to-Text to transcribe the explanation and routes it to picoLLM for intelligent reasoning.

This AI IVR architecture optimizes for common cases while handling edge cases flexibly.

Create Custom Voice Commands for Customer Service Automation

Rhino requires a context file that defines the specific intents the smart IVR will handle. A context specifies the phrases customers might say and what structured data to extract.

Sign up for a Picovoice Console account and navigate to the Rhino page.
Click "Create New Context" and name it CustomerService.
Click the "Import YAML" button in the top-right corner and paste the following context definition:

context:
  expressions:
    checkOrderStatus:
      - "@lookup (the) status [of, for] order $pv.Alphanumeric:orderId"
      - "[track, find] order (number) $pv.Alphanumeric:orderId"
    checkAccountBalance:
      - "@lookup (my) account balance"
    returnPolicy:
      - "@lookup (the) return policy"
      - "@action [return, send back, exchange] [an, the] [order, item]"
    speakToHuman:
      - "(@action) [speak, talk] to (a, an) $department:dept [agent, representative]"
      - "[connect, transfer] (me) (to) (a, an) $department:dept [agent, representative]"
  slots:
    department:
      - billing
      - technical support
      - returns
      - sales
      - customer service
  macros:
    lookup:
      - check
      - what is
      - tell me
      - I want
      - get
    action:
      - can I
      - could I
      - I want to
      - I'd like to

Test the context in the browser using the microphone button.
Download the .rhn context file for your target platform.

For production-ready customer service voicebots, expand the context to cover 10-15 common intents. Rhino's expression syntax supports optional phrases, synonyms, and slot types like numbers and dates. See the Rhino Expression Syntax Cheat Sheet for details.

Set Up a Local LLM

picoLLM runs compressed language models locally in your environment (for example on an IVR application server), so audio and transcripts can be processed without sending data to the cloud. Download a model from the picoLLM Console:

Sign in to Picovoice Console and navigate to picoLLM.
Select a model. This tutorial uses llama-3.2-3b-instruct-505.pllm.
Click "Download" and place it in your project directory.

Set Up the Python Environment

Install the required SDKs:

Cobra Voice Activity Detection Python SDK pvcobra,
Rhino Speech-to-Intent Python SDK pvrhino,
Cheetah Streaming Speech-to-Text Python SDK pvcheetah,
picoLLM Python SDK picollm,
Orca Text-to-Speech Python SDK pvorca,
Picovoice Python Recorder SDK pvrecorder,
Picovoice Python Speaker SDK pvspeaker.

pip install pvcobra pvrhino pvcheetah picollm pvorca pvrecorder pvspeaker

Add Voice Activity Detection for Caller Speech Gating

Initialize Cobra Voice Activity Detection and wait until the caller starts speaking:

import pvcobra
from pvrecorder import PvRecorder

ACCESS_KEY = "${ACCESS_KEY}"

cobra = pvcobra.create(access_key=ACCESS_KEY)
recorder = PvRecorder(cobra.frame_length)
recorder.start()

print("Waiting for caller to speak...")

voice_probability = 0.0
while voice_probability <= 0.5:
    pcm = recorder.read()
    voice_probability = cobra.process(pcm)

print("Voice detected — routing to speech recognition...")

recorder.stop()
recorder.delete()
cobra.delete()

Implement Real-Time Intent Recognition

The conversational IVR captures audio and processes it through Rhino Speech-to-Intent to detect known intents:

import pvcobra
import pvrhino
from pvrecorder import PvRecorder

ACCESS_KEY = "${ACCESS_KEY}"
CONTEXT_PATH = "./models/customer-service.rhn"

cobra = pvcobra.create(access_key=ACCESS_KEY)
rhino = pvrhino.create(
    access_key=ACCESS_KEY,
    context_path=CONTEXT_PATH)

# Stage 1: Wait for voice activity
recorder = PvRecorder(frame_length=cobra.frame_length)
recorder.start()

print("Waiting for caller to speak...")

voice_probability = 0.0
while voice_probability <= 0.5:
    pcm = recorder.read()
    voice_probability = cobra.process(pcm)

print("Voice detected — processing intent...")

recorder.stop()
recorder.delete()

# Stage 2: Process with Rhino
recorder = PvRecorder(frame_length=rhino.frame_length)
recorder.start()

is_finalized = False
while not is_finalized:
    pcm = recorder.read()
    is_finalized = rhino.process(pcm

inference = rhino.get_inference()
print(f"[INTENT] {inference.intent if inference.is_understood else 'Not understood'}")

recorder.stop()
recorder.delete()

Build Intelligent Call Routing Logic for AI IVR Systems

The intelligent call routing logic determines whether to handle the query with intent recognition or route to picoLLM for reasoning:

def handle_customer_query(inference):
    if inference.is_understood:
        # Fast path: structured intent with slots
        return handle_structured_intent(inference.intent, inference.slots)
    else:
        # Fallback: prompt for more details and use LLM
        return None  # Signal to prompt user

def handle_structured_intent(intent: str, slots: dict[str, str]) -> str:
    """Handle known intents with direct data retrieval"""
    if intent == "checkOrderStatus":
        order_id = slots.get("orderId", "unknown")
        return f"Order {order_id} is currently in transit and expected to arrive on February 2nd."

    elif intent == "checkAccountBalance":
        return "Your current account balance is $127.50."

    elif intent == "returnPolicy":
        return "You can return items within 30 days of purchase for a full refund with the original receipt."

    elif intent == "speakToHuman":
        dept = slots["dept"]
        return f"I'm connecting you to a {dept} representative now. Please hold."

    return "I can help with that. Let me look that up for you."

response = handle_customer_query(inference)
if response:
    print(f"[RESPONSE] {response}")
else:
    print("[ROUTING] Query not understood, will prompt for details...")

Handle Complex Queries with Speech-to-Text

When Rhino Speech-to-Intent doesn't recognize an intent, prompt the customer for more details and use Cheetah Streaming Speech-to-Text to transcribe their explanation:

def get_detailed_transcript(cobra, cheetah) -> str:
    """Wait for voice activity, then transcribe detailed explanation"""
    prompt = "I'd be happy to help. Can you explain your question in more detail?"
    print(f"[PROMPT] {prompt}")
    
    wait_for_voice_activity(cobra)
    
    recorder = PvRecorder(frame_length=cheetah.frame_length)
    recorder.start()
    
    print("Transcribing...")
    transcript = ""
    
    is_endpoint = False
    while not is_endpoint:
        pcm = recorder.read()
        partial_transcript, is_endpoint = cheetah.process(pcm)
        transcript += partial_transcript
        print(partial_transcript, end="", flush=True)
    
    final_transcript = cheetah.flush()
    transcript += final_transcript
    
    recorder.stop()
    recorder.delete()
    
    print(f"\n[TRANSCRIPT] {transcript}")
    return transcript

Add LLM Reasoning for Complex Queries

When Rhino Speech-to-Intent cannot extract a structured intent, picoLLM provides intelligent reasoning while keeping inference local to the IVR application server:

import picollm

ACCESS_KEY = "${ACCESS_KEY}"
MODEL_PATH = "./models/llama-3.2-3b-instruct-505.pllm"

# Initialize picoLLM once at startup
llm = picollm.create(
    access_key=ACCESS_KEY,
    model_path=MODEL_PATH)

def handle_with_llm(transcript: str) -> str:
    """Generate response using local language model"""
    system_prompt = """You are a helpful customer service assistant. 
Provide clear, concise answers to customer questions. 
If you need information you don't have, acknowledge the limitation and offer to connect them with a human agent."""
    
    dialog = llm.get_dialog(system=system_prompt)
    dialog.add_human_request(transcript)
    
    res = llm.generate(prompt=dialog.prompt(), completion_token_limit=150)
    
    return res.completion

# Example complex query
complex_transcript = "I was charged twice for an order I cancelled. Why did that happen?"
response = handle_with_llm(complex_transcript)
print(f"[LLM RESPONSE] {response}")

Add Text-to-Speech for Conversational IVR

The conversational IVR converts text responses into natural speech using Orca:

import pvorca
from pvspeaker import PvSpeaker
from collections import deque

ACCESS_KEY = "${ACCESS_KEY}"

orca = pvorca.create(access_key=ACCESS_KEY)
speaker = PvSpeaker(
    sample_rate=orca.sample_rate, 
    bits_per_sample=16)

def speak_response(text: str) -> None:
    """Convert text to speech and play audio"""
    pcm_out, _ = orca.synthesize(text)
    
    speaker.start()
    pcm_buffer = deque()
    pcm_buffer.append(pcm_out)
    
    while len(pcm_buffer) > 0:
        pcm = pcm_buffer.popleft()
        written = speaker.write(pcm)
        if written < len(pcm):
            pcm_buffer.appendleft(pcm[written:])
    
    speaker.flush()
    speaker.stop()

# Example usage
speak_response("Your order is on its way and will arrive tomorrow.")

Complete Python Code for Call Center Automation

This complete implementation combines all components into a smart IVR for call center automation:

import argparse
import sys
from collections import deque
import pvcobra
import pvcheetah
import pvrhino
import pvorca
from pvrecorder import PvRecorder
from pvspeaker import PvSpeaker
import picollm

VOICE_ACTIVITY_THRESHOLD = 0.5


def wait_for_voice_activity(cobra) -> None:
    """Use Cobra VAD to wait until the caller starts speaking"""
    recorder = PvRecorder(frame_length=cobra.frame_length)
    recorder.start()

    print("Waiting for caller to speak...")

    voice_probability = 0.0
    while voice_probability <= VOICE_ACTIVITY_THRESHOLD:
        pcm = recorder.read()
        voice_probability = cobra.process(pcm)

    print("Voice detected — processing...")

    recorder.stop()
    recorder.delete()


def handle_structured_intent(intent: str, slots: dict[str, str]) -> str:
    """Fast path: handle known intents with direct responses"""
    if intent == "checkOrderStatus":
        order_id = slots.get("orderId", "unknown")
        return f"Order {order_id} is currently in transit and expected to arrive on February 2nd."

    elif intent == "checkAccountBalance":
        return "Your current account balance is $127.50."

    elif intent == "returnPolicy":
        return "You can return items within 30 days of purchase for a full refund with the original receipt."

    elif intent == "speakToHuman":
        dept = slots["dept"]
        return f"I'm connecting you to a {dept} representative now. Please hold."

    return "I can help with that. Let me look that up for you."


def get_detailed_transcript(cobra, cheetah) -> str:
    """Wait for voice activity, then capture detailed explanation using Cheetah"""

    wait_for_voice_activity(cobra)

    recorder = PvRecorder(frame_length=cheetah.frame_length)
    recorder.start()

    print("Transcribing...")
    transcript = ""

    is_endpoint = False
    while not is_endpoint:
        pcm = recorder.read()
        partial_transcript, is_endpoint = cheetah.process(pcm)
        transcript += partial_transcript
        print(partial_transcript, end="", flush=True)

    final_transcript = cheetah.flush()
    transcript += final_transcript

    recorder.stop()
    recorder.delete()

    print(f"\n[TRANSCRIPT] {transcript}")
    return transcript


def handle_with_llm(llm, transcript: str) -> str:
    """Fallback path: use LLM for complex queries"""
    system_prompt = """You are a helpful customer service assistant.
Provide clear, concise answers to customer questions.
If you need information you don't have, acknowledge the limitation and offer to connect them with a human agent.
Keep responses under 50 words when possible."""

    dialog = llm.get_dialog(system=system_prompt)
    dialog.add_human_request(transcript)

    res = llm.generate(prompt=dialog.prompt(), completion_token_limit=150)

    return res.completion


def speak_response(orca, speaker, text: str) -> None:
    """Convert text to speech and play audio"""
    pcm_out, _ = orca.synthesize(text)

    speaker.start()
    pcm_buffer = deque()
    pcm_buffer.append(pcm_out)

    while len(pcm_buffer) > 0:
        pcm = pcm_buffer.popleft()
        written = speaker.write(pcm)
        if written < len(pcm):
            pcm_buffer.appendleft(pcm[written:])

    speaker.flush()
    speaker.stop()


def main():
    parser = argparse.ArgumentParser(
        description="Smart IVR system with Picovoice for call center automation"
    )
    parser.add_argument("--access_key", required=True,
                        help="Picovoice AccessKey")
    parser.add_argument("--context_path", required=True,
                        help="Path to .rhn Rhino context file")
    parser.add_argument("--model_path", required=True,
                        help="Path to .pllm picoLLM model file")
    args = parser.parse_args()

    cobra = None
    cheetah = None
    rhino = None
    llm = None
    orca = None
    recorder = None
    speaker = None

    try:
        # Initialize engines
        cobra = pvcobra.create(access_key=args.access_key)
        cheetah = pvcheetah.create(
            access_key=args.access_key,
            endpoint_duration_sec=1.0)
        rhino = pvrhino.create(
            access_key=args.access_key,
            context_path=args.context_path)
        llm = picollm.create(
            access_key=args.access_key,
            model_path=args.model_path)
        orca = pvorca.create(access_key=args.access_key)

        print(f'Cobra version: {cobra.version}')
        print(f'Cheetah version: {cheetah.version}')
        print(f'Rhino version: {rhino.version}')
        print(f'Orca version: {orca.version}\n')

        speaker = PvSpeaker(
            sample_rate=orca.sample_rate,
            bits_per_sample=16)

        # Play initial greeting
        greeting = "Hello, how can I help you today?"
        print(f"[GREETING] {greeting}")
        speak_response(orca, speaker, greeting)

        while True:
            # Stage 1: Wait for voice activity with Cobra
            wait_for_voice_activity(cobra)

            # Stage 2: Intent recognition with Rhino
            recorder = PvRecorder(frame_length=rhino.frame_length)
            recorder.start()

            print("Processing intent...")

            is_finalized = False
            while not is_finalized:
                pcm = recorder.read()
                is_finalized = rhino.process(pcm)

            inference = rhino.get_inference()
            recorder.stop()
            recorder.delete()

            # Stage 3: Route query
            if inference.is_understood:
                print(f"[INTENT] {inference.intent}")
                response = handle_structured_intent(inference.intent, inference.slots)
            else:
                print("[ROUTING] Intent not recognized, prompting for details...")
                prompt = "I'd be happy to help. Can you explain your question in more detail?"
                speak_response(orca, speaker, prompt)

                transcript = get_detailed_transcript(cobra, cheetah)
                if not transcript.strip():
                    response = "I didn't catch that. Could you try again?"
                else:
                    response = handle_with_llm(llm, transcript)

            print(f"[RESPONSE] {response}\n")

            # Stage 4: Speak response
            speak_response(orca, speaker, response)

            rhino.reset()

    except KeyboardInterrupt:
        print("\n[EXIT] Stopping...")
    except pvcobra.CobraActivationLimitError:
        print("AccessKey has reached its processing limit")
    except pvcheetah.CheetahActivationLimitError:
        print("AccessKey has reached its processing limit")
    except pvrhino.RhinoActivationLimitError:
        print("AccessKey has reached its processing limit")
    except pvorca.OrcaActivationLimitError:
        print("AccessKey has reached its processing limit")
    finally:
        if speaker is not None:
            speaker.delete()
        if recorder is not None:
            recorder.delete()
        if orca is not None:
            orca.delete()
        if llm is not None:
            llm.release()
        if rhino is not None:
            rhino.delete()
        if cheetah is not None:
            cheetah.delete()
        if cobra is not None:
            cobra.delete()

    return 0


if __name__ == "__main__":
    sys.exit(main())

Run the Smart IVR System

To run the Smart IVR system in Python, update the model paths to match your local files and have your Picovoice AccessKey ready:

python3 smart_ivr.py \
  --access_key "$ACCESS_KEY" \
  --context_path ./models/customer-service.rhn \
  --model_path ./models/llama-3.2-3b-instruct-505.pllm

The customer service voicebot will greet the caller, process customer queries with intelligent call routing, and respond with natural speech.

Extending the AI Customer Service Voicebot

Connect to Phone Systems:

Integrate with VoIP platforms like Twilio or Asterisk to handle inbound calls.

Add Multilingual Support:

Create Speech-to-Intent contexts for multiple languages. Rhino supports multiple languages for intent recognition.
Orca Streaming Text-to-Speech also supports multiple languages for voice responses.

Database Integration:

Replace the mock responses in handle_structured_intent() with actual database queries to retrieve real customer data, order statuses, and account information.

Conversation Analytics:

Log all transcripts, detected intents, and LLM responses to track common queries, measure resolution rates, and identify areas where the context needs expansion or LLM responses need refinement.

Human Handoff:

Implement a queue system for the speakToHuman intent that connects to your existing call center software or creates tickets for callback scheduling.

You can start building your own commercial or non-commercial call center automation projects using Picovoice's self-service Console.

To learn more about the advantages and challenges of voice AI agents in customer service, see: Voice AI Agents in Customer Service.

Start Building

Frequently Asked Questions

What does IVR stand for?

IVR stands for Interactive Voice Response. It's a technology that allows callers to interact with a phone system through voice commands or keypad inputs. Traditional IVR systems use pre-recorded menus and numbered options, while smart IVR systems use AI to understand natural speech and provide conversational experiences.

What is a smart IVR?

A smart IVR is an AI-powered phone system that understands natural language, allowing callers to speak requests directly rather than navigating numbered menu options. It uses speech recognition to interpret caller intent and provide relevant responses or route calls appropriately. Picovoice offers speech-to-text, intent recognition, and LLM capabilities that run locally on your infrastructure to build smart IVR systems without cloud dependencies.

Which company has the best IVR?

The best IVR solution depends on your specific requirements. For call centers prioritizing low latency and data privacy, Picovoice's AI models enable you to build custom smart IVR systems that run speech recognition, intent detection, LLM reasoning, and voice synthesis locally on your infrastructure. This eliminates cloud API round-trips that add 1-2 seconds per interaction, and keeps caller audio and transcripts on your servers.

What are common IVR problems?

Traditional IVR systems can frustrate callers with rigid menu trees, slow response times, and poor speech recognition that forces people to repeat themselves. Smart IVR systems address these issues with AI-powered natural language understanding and flexible conversation flows. For call centers prioritizing speed and privacy, processing speech locally on your infrastructure eliminates cloud API latency and keeps caller data on your servers.