Build a Smart Factory Voice Agent in Python

🚀 On-device Voice AI & LLMs

Build commercial, non-commercial, research projects using the Forever-Free Plan.

TLDR: Build a hands-free factory voice agent using Python and on-device AI. Enables equipment control and production queries through wake word detection, speech-to-intent, and local LLM processing. All processing stays on-premises for privacy compliance.

Hands-free voice control can streamline factory operations, allowing workers to manage equipment, query production data, and report maintenance issues without interrupting their tasks. This Smart Factory Voice Agent in Python runs entirely on-device, combining wake word detection, speech-to-intent, streaming speech-to-text, text-to-speech and local LLM processing to provide instant equipment control and real-time production insights, all while keeping data on-premises for full privacy compliance. In this tutorial, you’ll learn how to build a fully operational factory voice assistant using Python and Picovoice’s on-device AI stack.

What You'll Build:

A factory voice agent that:

Activates using two custom wake phrases - one for equipment commands (e.g., "Hey Factory") and one for production queries (e.g., "Hey Assistant")
Controls equipment instantly without manual intervention
Queries real-time production data (output rates, machine status, inventory levels)
Handles maintenance requests and troubleshooting information

The voice agent’s fully on-device architecture ensures that it:

Achieves high accuracy and low-latency responses, with all speech recognition processed locally using engines trained for noisy factory environments and multiple accents.
Meets strict privacy standards, as all voice data stays on-premises and never leaves the facility.

Requirements for Building a Manufacturing Voice Agent:

Python 3.9+
Microphone
Speakers or headset for audio feedback
Picovoice AccessKey from the Picovoice Console

Smart Factory Voice Agent Workflow

This Python-based factory voice agent uses an on-device architecture designed for reliability and low latency:

How it works:

Always-Listening Activation - The factory voice agent sits in a low-power, idle state using Porcupine Wake Word to monitor the audio stream for two distinct wake phrases. Detecting "Hey Factory" routes to instant equipment control, while "Hey Assistant" routes to the conversational AI for detailed queries. This dual-keyword approach lets workers choose the right path upfront.
Intent Understanding for Equipment Control - When "Hey Factory" is detected, the audio is analyzed by Rhino Speech-to-Intent. Instead of transcribing words one by one, it maps the speech directly to a pre-defined command (like "Emergency. Please stop"). The system executes the action immediately without further processing.
Speech-to-Text for Conversational Queries - When "Hey Assistant" is detected, the system routes directly to Cheetah Streaming Speech-to-Text. This engine converts natural, open-ended speech into a text string, capturing the full detail of complex questions or reports.
On-Device Language Model - The transcribed text is passed to picoLLM, which runs a specialized language model locally on the device. It interprets the user's question using the specific factory context such as shift data or machine specs to generate a relevant, intelligent text response.
Voice Response Generation - Finally, Orca Streaming Text-to-Speech converts the AI's text response into spoken audio. This provides the worker with immediate verbal confirmation or information, completing the hands-free loop.

The Voice Agent routes time-critical equipment commands for instant execution while handling complex production data queries through the LLM pipeline. All processing runs locally on industrial PCs or edge devices, eliminating network latency and ensuring reliable operation.

All Picovoice models such as Porcupine Wake Word, Rhino Speech-to-Intent, Cheetah Streaming Speech-to-Text, and Orca Streaming Text-to-Speech support multiple languages including English, Spanish, French, German and more. Build multilingual factory voice agents to serve international workforces by training models in the languages your teams speak.

Train Custom Wake Words for Factory Voice Agent

Sign up for a Picovoice Console account and navigate to the Porcupine page.
Enter your wake phrase such as "Hey Factory", and test it using the microphone button.
Click "Train," select the target platform, and download the .ppn model file.
Repeat steps 2 & 3 for to train an additional wake word for any production queries (e.g., "Hey Assistant")

Porcupine can detect multiple wake words simultaneously. For instance, support both "Hey Factory" and "Hey Assistant" for different tasks. For tips on designing an effective wake word, review the choosing a wake word guide.

Define Voice Commands for Equipment Control

Create an empty Rhino Speech-to-Intent Context.
Click the "Import YAML" button in the top-right corner of the console and paste the YAML provided below to define intents for factory equipment commands.
Test the model with the microphone button and download the .rhn context file for your target platform.

You can refer to the Rhino Syntax Cheat Sheet for more details on building custom contexts.

YAML Context for Factory Equipment Commands:

context:
  expressions:
    startEquipment:
      - "start $equipment:equipment"
      - "turn on $equipment:equipment"
      - "activate $equipment:equipment"
    
    stopEquipment:
      - "stop $equipment:equipment"
      - "turn off $equipment:equipment"
      - "shut down $equipment:equipment"
      - "emergency stop"
    
    adjustSpeed:
      - "[set, adjust, change] $equipment:equipment (to) $speed:speed"

    
    queryStatus:
      - "status of $equipment:equipment"
      - "check $equipment:equipment"
      - "is $equipment:equipment running"
    
    reportIssue:
      - "report (a) problem with $equipment:equipment"
      - "$equipment:equipment (is) malfunctioning"
      - "(there is an) issue with $equipment:equipment"

  slots:
    equipment:
      - "conveyor belt one"
      - "conveyor belt two"
      - "conveyor belt three"
      - "assembly line"
      - "press one"
      - "press two"
      - "packaging machine"
      - "quality station"
      - "robotic arm"
    
    speed:
      - "one hundred twenty five"
      - "one hundred fifty"
      - "one hundred seventy five"
      - "two hundred"
      - "two hundred twenty five"
      - "two hundred fifty"

This context handles critical equipment control commands.

Set Up Local Large Language Model

Navigate to the picoLLM page in Picovoice Console.
Select a model. This tutorial uses llama-3.2-3b-instruct-505.pllm.
Download the .pllm file and place it in your project directory.

Install Required Python Libraries for Factory Voice Control

Install all required Python SDKs and dependencies using pip:

Porcupine Wake Word Python SDK: pvporcupine
Rhino Speech-to-Intent Python SDK: pvrhino
Cheetah Streaming Speech-to-Text Python SDK: pvcheetah
picoLLM Python SDK: picollm
Orca Streaming Text-to-Speech Python SDK: pvorca
Picovoice Python Recorder library: pvrecorder
Picovoice Python Speaker library: pvspeaker

pip install pvporcupine pvrhino pvcheetah picollm pvorca pvrecorder pvspeaker

Add Wake Word Detection for Hands-Free Activation

The following code captures audio from your microphone and detects the custom wake word locally:

import pvporcupine
from pvrecorder import PvRecorder

ACCESS_KEY = "${ACCESS_KEY}"
COMMAND_KEYWORD_PATH = "${COMMAND_KEYWORD_PATH}"  # Path to "Hey Factory" .ppn file
QUERY_KEYWORD_PATH = "${QUERY_KEYWORD_PATH}"  # Path to "Hey Assistant" .ppn file

porcupine = pvporcupine.create(
    access_key=ACCESS_KEY,
    keyword_paths=[COMMAND_KEYWORD_PATH, QUERY_KEYWORD_PATH]
)

recorder = PvRecorder(frame_length=porcupine.frame_length)
recorder.start()

print("Listening for wake word...")

try:
    while True:
        pcm = recorder.read()
        keyword_index = porcupine.process(pcm)
        
        if keyword_index == 0:
            print("Command wake word detected - routing to equipment control")
            # Route to Rhino for equipment commands
            break
        elif keyword_index == 1:
            print("Query wake word detected - routing to production assistant")
            # Route to Cheetah + picoLLM for production queries
            break
except KeyboardInterrupt:
    print("\nStopping...")
finally:
    recorder.stop()
    recorder.delete()
    porcupine.delete()

Porcupine Wake Word processes each audio frame on-device with acoustic models trained to reject machinery noise and false positives. By listening for multiple wake words simultaneously, it routes workers to the right system path instantly - equipment control or production queries - without wasted processing.

Process Equipment Control Commands

Once the wake word is detected, Rhino Speech-to-Intent listens for structured equipment commands:

import pvrhino
from pvrecorder import PvRecorder

ACCESS_KEY = "${ACCESS_KEY}"
CONTEXT_PATH = "${CONTEXT_PATH}"  # Path to .rhn file

rhino = pvrhino.create(
    access_key=ACCESS_KEY,
    context_path=CONTEXT_PATH
)

recorder = PvRecorder(frame_length=rhino.frame_length)
recorder.start()

print("Listening for command...")

try:
    while True:
        pcm = recorder.read()
        is_finalized = rhino.process(pcm)
        
        if is_finalized:
            inference = rhino.get_inference()
            
            if inference.is_understood:
                print('{')
                print("  intent : '%s'" % inference.intent)
                print('  slots : {')
                for slot, value in inference.slots.items():
                    print("    %s : '%s'" % (slot, value))
                print('  }')
                print('}\n')
                
                # Execute equipment control command
                execute_equipment_control(ACCESS_KEY, inference.intent, inference.slots)
            else:
                print("Didn't understand the command. Please try again.")
            
            break
except KeyboardInterrupt:
    print("\nStopping...")
finally:
    recorder.stop()
    recorder.delete()
    rhino.delete()

Rhino Speech-to-Intent directly infers intent from speech without requiring a separate transcription step, enabling instant equipment control.

Handle Production Data Queries with AI

When users say "Hey Assistant," the system routes directly to streaming speech-to-text and local LLM for natural language queries:

import pvcheetah
import picollm
from pvrecorder import PvRecorder

ACCESS_KEY = "${ACCESS_KEY}"
PICOLLM_MODEL_PATH = "${PICOLLM_MODEL_PATH}"  # Path to .pllm file

def handle_production_query():
    """Process production data queries using Cheetah + picoLLM"""
    
    # Initialize Cheetah for speech-to-text
    cheetah = pvcheetah.create(
        access_key=ACCESS_KEY,
        endpoint_duration_sec=1.0
    )
    
    recorder = PvRecorder(frame_length=cheetah.frame_length)
    recorder.start()
    
    print("Speak your question...")
    transcript = ""
    
    try:
        while True:
            pcm = recorder.read()
            partial_transcript, is_endpoint = cheetah.process(pcm)
            transcript += partial_transcript
            print(partial_transcript, end="", flush=True)
            
            if is_endpoint:
                final_transcript = cheetah.flush()
                transcript += final_transcript
                print(final_transcript)
                break
    except KeyboardInterrupt:
        print("\nStopping...")
    finally:
        recorder.stop()
        recorder.delete()
        cheetah.delete()
    
    # Process with picoLLM
    pllm = picollm.create(
        access_key=ACCESS_KEY,
        model_path=PICOLLM_MODEL_PATH
    )
    
    # Factory-specific context for LLM
    production_data = """
    Production Data (Current Shift):
    - Line 1 Output: 1,247 units (Target: 1,200)
    - Line 2 Output: 983 units (Target: 1,000)
    - Quality Defect Rate: 0.8%
    - Current Shift: Day shift (6 AM - 2 PM)
    - Next Maintenance: Press 1 scheduled for 10 PM tonight
    - Inventory: Raw material A at 78% capacity
    """
    
    prompt = f"{production_data}\n\nOperator question: {transcript}\n\nProvide a helpful, concise response:"
    
    print("\nGenerating response...")
    
    response = pllm.generate(
        prompt=prompt,
        completion_token_limit=150,
        temperature=0.3
    )
    
    print(f"\nAssistant: {response.completion}")
    
    pllm.release()
    
    return response.completion

This approach uses Cheetah Streaming Speech-to-Text to transcribe natural speech with acoustic models optimized for industrial noise, then picoLLM to process the query and generate responses based on real-time production data.

Add AI Voice Response Generation for Smart Factories

Transform text responses into audible speech optimized for noisy environments:

def speak_response(access_key: str, text: str):
    """Convert text to speech and play"""
    orca = pvorca.create(access_key=access_key)
    speaker = PvSpeaker(
        sample_rate=orca.sample_rate,
        bits_per_sample=16
    )
    
    try:
        # Synthesize speech
        pcm_out, _ = orca.synthesize(text)
        
        # Play audio
        speaker.start()
        
        pcm_buffer = deque()
        pcm_buffer.append(pcm_out)
        
        while len(pcm_buffer) > 0:
            pcm = pcm_buffer.popleft()
            written = speaker.write(pcm)
            if written < len(pcm):
                pcm_buffer.appendleft(pcm[written:])
        
        speaker.flush()
        speaker.stop()
    except KeyboardInterrupt:
        print("\nStopping playback...")
        speaker.stop()
    finally:
        # Cleanup
        speaker.delete()
        orca.delete()

Orca Streaming Text-to-Speech generates clear voice responses with first audio output in under 130ms, enabling seamless communication on the factory floor.

Execute Equipment Control Commands and Integrate with Manufacturing Systems

Route structured intents to manufacturing systems:

def execute_equipment_control(access_key: str, intent: str, slots: dict):
    """Execute equipment control commands and provide feedback"""
    
    if intent == "startEquipment":
        equipment = slots.get('equipment', 'unknown')
        equipment_id = equipment.replace(' ', '_').lower()
        
        print(f"[SCADA] Starting {equipment}")
        # Integration point: send_to_scada_system(equipment_id, "start")
        
        speak_response(access_key, f"Starting {equipment}")
        
    elif intent == "stopEquipment":
        equipment = slots.get('equipment', 'all equipment')
        equipment_id = equipment.replace(' ', '_').lower() if equipment != 'all equipment' else 'emergency_all'
        
        print(f"[SCADA] Stopping {equipment}")
        # Integration point: send_to_scada_system(equipment_id, "stop")
        
        speak_response(access_key, f"Stopping {equipment}")
    
    elif intent == "adjustSpeed":
        equipment = slots.get('equipment', 'unknown')
        speed = slots.get('speed', '0')
        equipment_id = equipment.replace(' ', '_').lower()
        
        print(f"[SCADA] Setting {equipment} to {speed}%")
        # Integration point: send_to_scada_system(equipment_id, "set_speed", speed)
        
        speak_response(access_key, f"Setting {equipment} to {speed} percent")
        
    elif intent == "queryStatus":
        equipment = slots.get('equipment', 'unknown')
        
        # Integration point: status = query_mes_system(equipment_id)
        status = "running at 85 percent capacity"  # Simulated
        
        speak_response(access_key, f"{equipment} is currently {status}")
        
    elif intent == "reportIssue":
        equipment = slots.get('equipment', 'unknown')
        
        print(f"[Maintenance] Issue reported for {equipment}")
        # Integration point: create_maintenance_ticket(equipment_id, "malfunction")
        
        speak_response(access_key, f"Issue reported for {equipment}. Maintenance team notified")

Complete Python Code for Smart Factory Voice Agent

This implementation combines all components for a production-ready factory voice agent:

# Factory Voice Agent for Manufacturing Operations

import argparse
import os
from collections import deque

import pvporcupine
import pvrhino
import pvcheetah
import picollm
import pvorca
from pvrecorder import PvRecorder
from pvspeaker import PvSpeaker


def speak_response(access_key: str, text: str):
    """Convert text to speech and play"""
    orca = pvorca.create(access_key=access_key)
    speaker = PvSpeaker(
        sample_rate=orca.sample_rate,
        bits_per_sample=16
    )
    
    try:
        # Synthesize speech
        pcm_out, _ = orca.synthesize(text)
        
        # Play audio
        speaker.start()
        
        pcm_buffer = deque()
        pcm_buffer.append(pcm_out)
        
        while len(pcm_buffer) > 0:
            pcm = pcm_buffer.popleft()
            written = speaker.write(pcm)
            if written < len(pcm):
                pcm_buffer.appendleft(pcm[written:])
        
        speaker.flush()
        speaker.stop()
    except KeyboardInterrupt:
        print("\nStopping playback...")
        speaker.stop()
    finally:
        # Cleanup
        speaker.delete()
        orca.delete()


def execute_equipment_control(access_key: str, intent: str, slots: dict):
    """Execute equipment control commands"""
    
    if intent == "startEquipment":
        equipment = slots.get('equipment', 'unknown')
        print(f"[SCADA Command] Starting {equipment}")
        speak_response(access_key, f"Starting {equipment}")
        
    elif intent == "stopEquipment":
        equipment = slots.get('equipment', 'all equipment')
        print(f"[SCADA Command] Stopping {equipment}")
        speak_response(access_key, f"Stopping {equipment}")
        
    elif intent == "adjustSpeed":
        equipment = slots.get('equipment', 'unknown')
        speed = slots.get('speed', '0')
        print(f"[SCADA Command] Setting {equipment} to {speed}%")
        speak_response(access_key, f"Setting {equipment} to {speed} percent")
        
    elif intent == "queryStatus":
        equipment = slots.get('equipment', 'unknown')
        # Simulated status
        status = "running at 85 percent capacity"
        speak_response(access_key, f"{equipment} is currently {status}")
        
    elif intent == "reportIssue":
        equipment = slots.get('equipment', 'unknown')
        print(f"[Maintenance System] Issue reported for {equipment}")
        speak_response(access_key, f"Issue reported for {equipment}. Maintenance team notified")


def handle_equipment_command(access_key: str, context_path: str):
    """Process equipment commands using Rhino Speech-to-Intent"""
    
    try:
        rhino = pvrhino.create(
            access_key=access_key,
            context_path=context_path)
    except pvrhino.RhinoError as e:
        print("Failed to initialize Rhino")
        raise e

    print(f'Rhino version: {rhino.version}')

    recorder = PvRecorder(frame_length=rhino.frame_length)
    recorder.start()

    print('Listening for command...')

    try:
        while True:
            pcm = recorder.read()
            is_finalized = rhino.process(pcm)

            if is_finalized:
                inference = rhino.get_inference()
                if inference.is_understood:
                    print('{')
                    print("  intent : '%s'" % inference.intent)
                    print('  slots : {')
                    for slot, value in inference.slots.items():
                        print("    '%s' : '%s'" % (slot, value))
                    print('  }')
                    print('}\n')
                    
                    # Execute equipment command
                    execute_equipment_control(access_key, inference.intent, inference.slots)
                else:
                    print("Didn't understand the command. Please try again.")
                
                break

    except KeyboardInterrupt:
        print('\nStopping...')

    finally:
        recorder.stop()
        recorder.delete()
        rhino.delete()


def handle_production_query(access_key: str, pllm_model_path: str):
    """Process production data queries using Cheetah + picoLLM"""
    
    # Initialize Cheetah for speech-to-text
    cheetah = pvcheetah.create(
        access_key=access_key,
        endpoint_duration_sec=1.0
    )
    
    recorder = PvRecorder(frame_length=cheetah.frame_length)
    recorder.start()
    
    print("Speak your question...")
    transcript = ""
    
    try:
        while True:
            pcm = recorder.read()
            partial_transcript, is_endpoint = cheetah.process(pcm)
            transcript += partial_transcript
            print(partial_transcript, end="", flush=True)
            
            if is_endpoint:
                final_transcript = cheetah.flush()
                transcript += final_transcript
                print(final_transcript)
                break
    except KeyboardInterrupt:
        print("\nStopping...")
    finally:
        recorder.stop()
        recorder.delete()
        cheetah.delete()
    
    # Process with picoLLM
    pllm = picollm.create(
        access_key=access_key,
        model_path=pllm_model_path
    )
    
    production_data = """
    Production Data (Current Shift):
    - Line 1 Output: 1,247 units (Target: 1,200)
    - Line 2 Output: 983 units (Target: 1,000)
    - Quality Defect Rate: 0.8%
    - Current Shift: Day shift (6 AM - 2 PM)
    - Next Maintenance: Press 1 scheduled for 10 PM tonight
    - Inventory: Raw material A at 78% capacity
    """
    
    prompt = f"{production_data}\n\nOperator question: {transcript}\n\nProvide a helpful, concise response:"
    
    print("\nGenerating response...")
    
    response = pllm.generate(
        prompt=prompt,
        completion_token_limit=150,
        temperature=0.3
    )
    
    print(f"\nAssistant: {response.completion}")
    
    # Speak the response
    speak_response(access_key, response.completion)
    
    pllm.release()


def main():
    parser = argparse.ArgumentParser()

    parser.add_argument(
        '--access_key',
        help='AccessKey obtained from Picovoice Console (https://console.picovoice.ai/)',
        required=True)

    parser.add_argument(
        '--command_keyword_path',
        help='Absolute path to command wake word model file (.ppn) for equipment control',
        required=True)

    parser.add_argument(
        '--query_keyword_path',
        help='Absolute path to query wake word model file (.ppn) for production queries',
        required=True)

    parser.add_argument(
        '--context_path',
        help='Absolute path to Rhino context file (.rhn)',
        required=True)

    parser.add_argument(
        '--pllm_model_path',
        help='Absolute path to picoLLM model file (.pllm)',
        required=True)

    args = parser.parse_args()

    print("Factory Voice Agent")
    print("=" * 50)

    # Main loop for continuous operation
    while True:
        # Stage 1: Wake Word Detection with dual keywords
        try:
            porcupine = pvporcupine.create(
                access_key=args.access_key,
                keyword_paths=[args.command_keyword_path, args.query_keyword_path])
        except pvporcupine.PorcupineError as e:
            print("Failed to initialize Porcupine")
            raise e

        # Extract keyword names from filenames
        keywords = []
        for keyword_path in [args.command_keyword_path, args.query_keyword_path]:
            keyword_phrase_part = os.path.basename(keyword_path).replace('.ppn', '').split('_')
            if len(keyword_phrase_part) > 6:
                keywords.append(' '.join(keyword_phrase_part[0:-6]))
            else:
                keywords.append(keyword_phrase_part[0])

        print(f'Porcupine version: {porcupine.version}')

        recorder = PvRecorder(frame_length=porcupine.frame_length)
        recorder.start()

        print('Listening for wake word... (press Ctrl+C to exit)')
        print(f'  Say "{keywords[0]}" for equipment commands')
        print(f'  Say "{keywords[1]}" for production queries')

        detected_keyword_index = -1

        try:
            while True:
                pcm = recorder.read()
                result = porcupine.process(pcm)

                if result >= 0:
                    print(f'Detected "{keywords[result]}"')
                    detected_keyword_index = result
                    break

        except KeyboardInterrupt:
            print('\nStopping...')
            recorder.stop()
            recorder.delete()
            porcupine.delete()
            break

        finally:
            recorder.stop()
            recorder.delete()
            porcupine.delete()

        # Stage 2: Route based on detected wake word
        if detected_keyword_index == 0:
            # Command wake word - route to Rhino for equipment control
            handle_equipment_command(args.access_key, args.context_path)
        elif detected_keyword_index == 1:
            # Query wake word - route to Cheetah + picoLLM for production queries
            handle_production_query(args.access_key, args.pllm_model_path)


if __name__ == '__main__':
    main()

Run the Smart Factory Voice Agent

To run the factory voice agent in Python, update the model paths to match your local files and have your Picovoice AccessKey ready:

python3 factory_agent.py \
  --access_key "$ACCESS_KEY" \
  --command_keyword_path ./models/hey-factory.ppn \
  --query_keyword_path ./models/hey-assistant.ppn \
  --context_path ./models/factory-commands.rhn \
  --pllm_model_path ./models/llama-3.2-3b-instruct-505.pllm

Example interactions:

Equipment Control:

User: "Hey Factory, start conveyor belt two."
Agent: "Starting conveyor belt two."

Production Query:

User: "Hey Assistant, what's our current output rate?"
Agent: "Line 1 is currently at 1,247 units against a target of 1,200. Line 2 is slightly behind at 983 units."

Looking to integrate voice with other manufacturing applications? Read about how Picovoice Enables Voice Picking to improve warehouse efficiency and accuracy.

You can start building your own commercial or non-commercial projects using Picovoice's self-service Console.

Start Building