Build a Low-Latency ChatGPT Voice Assistant in Python

🏢 Enterprise AI Consulting

Get dedicated help specific to your use case and for your hardware and software choices.

Building a ChatGPT voice assistant requires more than just connecting to the OpenAI API. ChatGPT voice applications need instant responses to feel natural in conversation. While the consumer version of ChatGPT supports a built-in real-time ChatGPT Voice Mode, developers using the OpenAI API must build the speech-processing pipeline themselves.

OpenAI’s Speech-to-Text API provides speech recognition for ChatGPT voice chat through OpenAI Whisper and gpt-4o-transcribe models. Both solutions process audio in the cloud, adding network latency to every voice interaction and disrupting natural conversation flow.

The newer OpenAI Realtime API supports streaming audio input and output, enabling real-time, speech-to-speech interactions for ChatGPT voice applications. However, it is provided as a single integrated pipeline. Developers cannot customize components of the speech pipeline to match their specific requirements for building their ChatGPT voice agents.

This tutorial shows how to build a ChatGPT voice assistant in Python, inspired by ChatGPT Voice Mode but implemented with on-device speech processing. This approach follows a modular architecture, allowing each component of the speech pipeline to be customized and optimized for specific use cases. It performs real-time transcription locally, activates hands-free with a custom wake word, and generates natural, low-latency voice responses for the ChatGPT voice AI pipeline, providing both speed and flexibility without relying on cloud-based processing.

What You'll Build:

A hands-free ChatGPT voice assistant that:
- Activates with a custom wake word
- Transcribes speech to text in real time
- Sends text queries to ChatGPT via the OpenAI API
- Responds with natural, real-time voice output

What You'll Need:

Python 3.9+
Microphone and speakers
Picovoice AccessKey from the Picovoice Console
OpenAI API key from the OpenAI Platform page

The solution integrates ChatGPT with speech recognition engines Porcupine Wake Word, Cheetah Streaming Speech-to-Text, and Orca Streaming Text-to-Speech.

Looking to integrate voice with other AI chatbots? Check out our guides to build Claude Voice Assistant and Perplexity Voice Assistant.

Train a Custom Wake Word for ChatGPT Voice Assistant

Sign up for a Picovoice Console account and navigate to the Porcupine page.
Enter your wake phrase such as "Hey Chat G P T" and test it using the microphone button.
Click "Train", select the target platform, and download the .ppn model file.
Repeat steps 2 & 3 for any additional wake words you would like to support (e.g., "Hey Chatbot")

For tips on designing an effective wake word, review the choosing a wake word guide.

Set Up the Python Environment

Install all required Python SDKs and dependencies with a single terminal command:

Porcupine Wake Word Python SDK: pvporcupine
Cheetah Streaming Speech-to-Text Python SDK: pvcheetah
Orca Text-to-Speech Python SDK: pvorca
Picovoice Python Recorder library: pvrecorder
Picovoice Python Speaker library:pvspeaker
OpenAI Python library: openai — used for ChatGPT's OpenAI API integration.

pip install pvporcupine pvcheetah pvorca pvrecorder pvspeaker openai

Add Wake Word Detection to ChatGPT

The following code captures audio from your default microphone and detects the custom wake word locally:

import pvporcupine
from pvrecorder import PvRecorder

ACCESS_KEY = "${ACCESS_KEY}"

# Path to your Porcupine wake-word model file (.ppn) that triggers activation
# e.g., "./models/hey-chatgpt.ppn" 
KEYWORD_PATH = "${KEYWORD_PATH}"

porcupine = pvporcupine.create(access_key=ACCESS_KEY, keyword_paths=[KEYWORD_PATH])
recorder = PvRecorder(frame_length=porcupine.frame_length)
recorder.start()

print("Listening for wake word...")
while True:
    pcm = recorder.read()
    keyword_index = porcupine.process(pcm)
    if keyword_index >= 0:
        print("Wake word detected.")
        break

recorder.stop()

Porcupine Wake Word processes each audio frame on-device and triggers when the keyword is recognized, providing a signal that can be used to start the transcription phase.

Add Streaming Speech-to-Text to ChatGPT Voice Assistant

Once the wake word has been detected, capture audio frames and transcribe them in real-time with Cheetah Streaming Speech-to-Text:

import pvcheetah

ACCESS_KEY = "${ACCESS_KEY}"

cheetah = pvcheetah.create(
            access_key=ACCESS_KEY,
            endpoint_duration_sec= 1.0)

recorder = PvRecorder(frame_length=cheetah.frame_length)
recorder.start()

print("Speak your message…")
transcript = ""
while True:
    pcm = recorder.read()
    partial_transcript, is_endpoint = cheetah.process(pcm)
    transcript += partial_transcript
    print(partial_transcript, end="", flush=True)
    if is_endpoint:
        final_transcript = cheetah.flush()
        transcript += final_transcript
        print(final_transcript)
        break

recorder.stop()
cheetah.delete()

Once you make a natural pause in your speech, such as after asking a question, Cheetah detects it as an endpoint, signaling that you've finished speaking.

Send Voice Prompts to ChatGPT via OpenAI API

Send your prompt to ChatGPT using OpenAI's chat completions endpoint:

from openai import OpenAI

OPENAI_API_KEY = "${OPENAI_API_KEY}"

def ask_chatgpt(api_key: str, prompt: str, model: str = "gpt-4o") -> str:
    try:
        client = OpenAI(api_key=api_key)
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            timeout=45
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"I'm having trouble connecting to ChatGPT. Error: {e}"
        
 # transcript = "example user request"       
reply = ask_chatgpt(OPENAI_API_KEY, transcript)

This minimal integration sends text to ChatGPT while all speech processing remains local, reducing latency.

Convert ChatGPT Responses to Speech Locally

Convert ChatGPT's text response into natural speech using Orca Streaming Text-to-Speech and PvSpeaker:

import pvorca
from pvspeaker import PvSpeaker
from collections import deque

ACCESS_KEY = "${ACCESS_KEY}"
orca = pvorca.create(access_key=ACCESS_KEY)
speaker = PvSpeaker(sample_rate=orca.sample_rate, bits_per_sample=16)

# Synthesize speech
# reply = response from chatgpt
pcm_out, _ = orca.synthesize(reply)

# Play audio
speaker.start()

pcm_buffer = deque()
pcm_buffer.append(pcm_out)

while len(pcm_buffer) > 0:
    pcm = pcm_buffer.popleft()
    written = speaker.write(pcm)
    if written < len(pcm):
        pcm_buffer.appendleft(pcm[written:])

speaker.flush()
speaker.stop()

# Cleanup
speaker.delete()
orca.delete()

In this example, Orca performs single synthesis because the OpenAI API returns the full response all together. When used with a streaming model, Orca can generate and play audio in real time through streaming synthesis, enabling significantly lower latency than cloud-based alternatives.

Full Python Code for Voice-Enabled ChatGPT Assistant

This solution combines three Picovoice engines: Porcupine Wake Word, Cheetah Streaming Speech-to-Text, and Orca Streaming Text-to-Speech for seamless, real-time voice interactions.

import argparse
from collections import deque
import sys
from openai import OpenAI
import pvporcupine
import pvcheetah
import pvorca
from pvrecorder import PvRecorder
from pvspeaker import PvSpeaker


def main() -> int:
    parser = argparse.ArgumentParser(
        description="Porcupine + Cheetah + Orca voice interface for ChatGPT"
    )
    parser.add_argument("--access_key", required=True, help="Picovoice AccessKey")
    parser.add_argument("--keyword_paths", nargs='+', required=True, 
                       help="Path(s) to .ppn wake-word model(s)")
    parser.add_argument("--openai_key", required=True, help="OpenAI API key")
    args = parser.parse_args()

    porcupine = None
    cheetah = None
    orca = None
    recorder = None
    speaker = None

    try:
        # Initialize engines
        porcupine = pvporcupine.create(
            access_key=args.access_key, 
            keyword_paths=args.keyword_paths)
        cheetah = pvcheetah.create(access_key=args.access_key, endpoint_duration_sec=1.0)
        orca = pvorca.create(access_key=args.access_key)

        print(f'Porcupine version: {porcupine.version}')
        print(f'Cheetah version: {cheetah.version}')
        print(f'Orca version: {orca.version}\n')

        # Initialize speaker
        speaker = PvSpeaker(
            sample_rate=orca.sample_rate, 
            bits_per_sample=16)

        # Initialize recorder
        recorder = PvRecorder(frame_length=porcupine.frame_length)
        recorder.start()

        print("Ready. Say the wake word… (Ctrl+C to stop)")

        # Wait for wake word
        while True:
            pcm = recorder.read()
            keyword_index = porcupine.process(pcm)
            if keyword_index >= 0:
                print("[EVENT] Wake word detected")
                break

        recorder.stop()
        recorder.delete()
        recorder = PvRecorder(frame_length=cheetah.frame_length)
        recorder.start()

        # Stream STT with Cheetah
        print("Speak your message…")
        transcript = ""
        while True:
            pcm = recorder.read()
            partial_transcript, is_endpoint = cheetah.process(pcm)
            transcript += partial_transcript
            print(partial_transcript, end="", flush=True)
            if is_endpoint:
                final_transcript = cheetah.flush()
                transcript += final_transcript
                print(final_transcript)
                break

        print("[TRANSCRIPT]", transcript)

        # Call ChatGPT
        client = OpenAI(api_key=args.openai_key)
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": transcript}],
            timeout=45
        )

        reply = response.choices[0].message.content
        print("[REPLY]", reply)

        # Synthesize speech with Orca
        pcm_out, _ = orca.synthesize(reply)

        # Play audio
        speaker.start()

        pcm_buffer = deque()
        pcm_buffer.append(pcm_out)

        while len(pcm_buffer) > 0:
            pcm = pcm_buffer.popleft()
            written = speaker.write(pcm)
            if written < len(pcm):
                pcm_buffer.appendleft(pcm[written:])

        speaker.flush()
        speaker.stop()

    except KeyboardInterrupt:
        print("\n[EXIT] Stopping…")
    except pvporcupine.PorcupineActivationLimitError:
        print("AccessKey has reached its processing limit")
    except pvcheetah.CheetahActivationLimitError:
        print("AccessKey has reached its processing limit")
    except pvorca.OrcaActivationLimitError:
        print("AccessKey has reached its processing limit")
    finally:
        # Cleanup
        if speaker is not None:
            speaker.delete()
        
        if recorder is not None:
            recorder.delete()
        
        if orca is not None:
            orca.delete()
        
        if cheetah is not None:
            cheetah.delete()
        
        if porcupine is not None:
            porcupine.delete()

    return 0


if __name__ == "__main__":
    sys.exit(main())

Run the ChatGPT Voice Assistant

To run the voice-enabled ChatGPT assistant, update the model path to match your local file and have both API keys ready:

Picovoice AccessKey (copy it from the Picovoice Console)
OpenAI API key

python3 voice_chatgpt.py \
  --access_key "$ACCESS_KEY" \
  --keyword_paths ./models/hey-chatgpt.ppn \
  --openai_key "$OPENAI_API_KEY" \

You can start building your own commercial or non-commercial projects leveraging Picovoice's self-service Console.

Start Building

Frequently Asked Questions

Will the voice assistant work accurately in noisy environments, with different accents, or with specialized terminology?

Yes. Porcupine Wake Word and Cheetah Streaming Speech-to-Text are designed to work reliably in real-world conditions with background noise and various accents across supported languages. For increasing accuracy on domain-specific terminology or brand names, you can also add boost words and custom vocabulary to Cheetah Streaming Speech-to-Text.

Can I use a different wake word instead of 'Hey chatbot' for my voice assistant?

Yes. You can train any custom wake word using Picovoice Console in seconds without collecting training data. Simply enter your desired phrase (e.g., "Hey Computer", or your brand name), and download the trained model. The wake word guide provides best practices for selecting effective wake phrases. You can also detect multiple wake words simultaneously to support different commands.

What will happen to the voice assistant if ChatGPT API calls fail or timeout during a conversation?

Network timeouts, rate limits, or API outages can cause the ChatGPT request to fail. You can catch these exceptions and use Orca Streaming Text-to-Speech to provide voice feedback like "I'm having trouble connecting, please try again." Since Porcupine Wake Word and Cheetah Streaming Speech-to-Text run entirely on-device, the voice interface remains functional during API failures and only ChatGPT responses are unavailable.