DeepSeek Voice Mode Tutorial: Add Voice Chat with Python

🚀 On-device Voice AI & LLMs

Build commercial, non-commercial, research projects using the Forever-Free Plan.

DeepSeek's R1 model delivers GPT-4-level reasoning at a fraction of the cost, but the API only accepts text input. It doesn't have native voice mode. Developers need to add voice capabilities by integrating separate speech recognition and synthesis services. Cloud-based voice AI solutions add at least 1-2 seconds of latency on top of DeepSeek's already lengthy 5+ second reasoning time, making interactions feel sluggish and unnatural.

Lightweight on-device AI models eliminates network latency entirely while maintaining low compute latency. The performance difference is significant: Orca Streaming Text-to-Speech generates the first byte of audio in 130ms versus 840ms for ElevenLabs, while Cheetah Streaming Speech-to-Text transcribes a word in 580ms after being uttered versus Amazon Transcribe Streaming's 920 ms

This tutorial demonstrates how to build a complete voice interface for DeepSeek R1 using on-device speech processing in Python. The implementation uses Porcupine Wake Word for voice-activated commands, Cheetah Streaming Speech-to-Text for real-time transcription, and Orca Streaming Text-to-Speech for natural responses, achieving the lowest latency for voice components while preserving DeepSeek's advanced reasoning capabilities.

What You'll Build:

A hands-free DeepSeek voice mode that:

Activates using a custom wake word
Transcribes speech in real-time locally
Sends recognized text to DeepSeek for reasoning
Speaks DeepSeek's response using local text-to-speech

This design enables interactive voice applications through multilingual voicebots and AI-powered voice agents.

What You'll Need:

Python 3.9+
Microphone and speakers
Picovoice AccessKey from the Picovoice Console
DeepSeek API key from the DeepSeek Platform page

Looking to integrate voice with other AI chatbots? See our guides for Claude Voice Assistant and Perplexity Voice Assistant.

Train a Custom Wake Word for DeepSeek Activation

Sign up for a Picovoice Console account and navigate to the Porcupine page.
Enter your wake phrase such as "Hey Deep Seek" and test it using the microphone button.
Click "Train", select the target platform, and download the .ppn model file.
Repeat steps 2 & 3 for any additional wake words you would like to support (e.g., "Hey Assistant").

Porcupine can detect multiple wake words with no added runtime footprint. For instance, use "Hey Assistant" and "Hey Deep Seek" simultaneously to activate the DeepSeek voice assistant. For tips on designing an effective wake word, review the choosing a wake word guide.

Set Up the Python Environment

Install all required Python SDKs and dependencies with a single command in the terminal:

Porcupine Wake Word Python SDK: pvporcupine
Cheetah Streaming Speech-to-Text Python SDK: pvcheetah
Orca Streaming Text-to-Speech Python SDK: pvorca
Picovoice Python Recorder library: pvrecorder
Picovoice Python Speaker library: pvspeaker
OpenAI Python Library: openai — used for sending API calls to DeepSeek (DeepSeek API is OpenAI-compatible)

pip install pvporcupine pvcheetah pvorca pvrecorder pvspeaker openai

Add Wake Word Detection to DeepSeek

The following code captures audio from your default microphone and detects the custom wake word locally:

import pvporcupine
from pvrecorder import PvRecorder

ACCESS_KEY = "${ACCESS_KEY}"

# Path to your Porcupine wake-word model file (.ppn) that triggers activation
# e.g., "./models/hey-deepseek.ppn" 
KEYWORD_PATH = "${KEYWORD_PATH}"

porcupine = pvporcupine.create(access_key=ACCESS_KEY, keyword_paths=[KEYWORD_PATH])
recorder = PvRecorder(frame_length=porcupine.frame_length)
recorder.start()

print("Listening for wake word...")
while True:
    pcm = recorder.read()
    keyword_index = porcupine.process(pcm)
    if keyword_index >= 0:
        print("Wake word detected.")
        break

recorder.stop()

Porcupine Wake Word processes each audio frame on-device and triggers when the keyword is recognized, providing a signal that can be used to start the transcription phase.

Generate Transcriptions for the DeepSeek Voice Mode

Once the wake word has been detected, the transcription loop is activated. The code captures short audio frames and transcribes them using Cheetah Streaming Speech-to-Text:

import pvcheetah

ACCESS_KEY = "${ACCESS_KEY}"

cheetah = pvcheetah.create(
            access_key=ACCESS_KEY,
            endpoint_duration_sec= 1.0)

recorder = PvRecorder(frame_length=cheetah.frame_length)
recorder.start()

print("Speak your message…")
transcript = ""
while True:
    pcm = recorder.read()
    partial_transcript, is_endpoint = cheetah.process(pcm)
    transcript += partial_transcript
    print(partial_transcript, end="", flush=True)
    if is_endpoint:
        final_transcript = cheetah.flush()
        transcript += final_transcript
        print(final_transcript)
        break

recorder.stop()
cheetah.delete()

Once you make a natural pause in your speech, such as after asking a question, Cheetah detects it as an endpoint, signaling that you've finished speaking.

Send Transcribed Text to DeepSeek

Once the text is transcribed, DeepSeek API processes the text prompt:

from openai import OpenAI

DEEPSEEK_API_KEY = "${DEEPSEEK_API_KEY}"

def ask_deepseek(api_key: str, prompt: str, model: str = "deepseek-chat") -> str:
    try:
        client = OpenAI(
            api_key=api_key,
            base_url="https://api.deepseek.com"
        )
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            timeout=45
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"I'm having trouble connecting to DeepSeek. Error: {e}"
        
 # transcript = "example user request"       
reply = ask_deepseek(DEEPSEEK_API_KEY, transcript)

Generate Voice Output from DeepSeek's Responses

The system transforms the DeepSeek's response into natural speech using Orca Streaming Text-to-Speech and PvSpeaker:

import pvorca
from pvspeaker import PvSpeaker
from collections import deque

ACCESS_KEY = "${ACCESS_KEY}"
orca = pvorca.create(access_key=ACCESS_KEY)
speaker = PvSpeaker(sample_rate=orca.sample_rate, bits_per_sample=16)

# Synthesize speech
# reply = response from deepseek
pcm_out, _ = orca.synthesize(reply)

# Play audio
speaker.start()

pcm_buffer = deque()
pcm_buffer.append(pcm_out)

while len(pcm_buffer) > 0:
    pcm = pcm_buffer.popleft()
    written = speaker.write(pcm)
    if written < len(pcm):
        pcm_buffer.appendleft(pcm[written:])

speaker.flush()
speaker.stop()

# Cleanup
speaker.delete()
orca.delete()

In this example, Orca performs single synthesis because the DeepSeek API returns the full response at once. When used with streaming models, Orca Streaming Text-to-Speech can generate and play audio in real time through streaming synthesis, enabling significantly lower latency than cloud-based alternatives.

Full Python Code to Get DeepSeek Voice Mode

This implementation combines three Picovoice engines: Porcupine Wake Word, Cheetah Streaming Speech-to-Text, and Orca Streaming Text-to-Speech. The voice processing happens entirely on-device, while only text queries are sent to the DeepSeek AI API.

import argparse
from collections import deque
import sys
from openai import OpenAI
import pvporcupine
import pvcheetah
import pvorca
from pvrecorder import PvRecorder
from pvspeaker import PvSpeaker


def main() -> int:
    parser = argparse.ArgumentParser(
        description="Porcupine + Cheetah + Orca voice interface for DeepSeek"
    )
    parser.add_argument("--access_key", required=True, help="Picovoice AccessKey")
    parser.add_argument("--keyword_paths", nargs='+', required=True, 
                       help="Path(s) to .ppn wake-word model(s)")
    parser.add_argument("--deepseek_key", required=True, help="DeepSeek API key")
    args = parser.parse_args()

    porcupine = None
    cheetah = None
    orca = None
    recorder = None
    speaker = None

    try:
        # Initialize engines
        porcupine = pvporcupine.create(
            access_key=args.access_key, 
            keyword_paths=args.keyword_paths)
        cheetah = pvcheetah.create(access_key=args.access_key, endpoint_duration_sec=1.0)
        orca = pvorca.create(access_key=args.access_key)

        print(f'Porcupine version: {porcupine.version}')
        print(f'Cheetah version: {cheetah.version}')
        print(f'Orca version: {orca.version}\n')

        # Initialize speaker
        speaker = PvSpeaker(
            sample_rate=orca.sample_rate, 
            bits_per_sample=16)

        # Initialize recorder
        recorder = PvRecorder(frame_length=porcupine.frame_length)
        recorder.start()

        print("Ready. Say the wake word… (Ctrl+C to stop)")

        # Wait for wake word
        while True:
            pcm = recorder.read()
            keyword_index = porcupine.process(pcm)
            if keyword_index >= 0:
                print("[EVENT] Wake word detected")
                break

        recorder.stop()
        recorder.delete()
        recorder = PvRecorder(frame_length=cheetah.frame_length)
        recorder.start()

        # Stream STT with Cheetah
        print("Speak your message…")
        transcript = ""
        while True:
            pcm = recorder.read()
            partial_transcript, is_endpoint = cheetah.process(pcm)
            transcript += partial_transcript
            print(partial_transcript, end="", flush=True)
            if is_endpoint:
                final_transcript = cheetah.flush()
                transcript += final_transcript
                print(final_transcript)
                break

        print("[TRANSCRIPT]", transcript)

        # Call DeepSeek
        client = OpenAI(
            api_key=args.deepseek_key,
            base_url="https://api.deepseek.com"
        )
        response = client.chat.completions.create(
            model="deepseek-chat",
            messages=[{"role": "user", "content": transcript}],
            timeout=45
        )

        reply = response.choices[0].message.content
        print("[REPLY]", reply)

        # Synthesize speech with Orca
        pcm_out, _ = orca.synthesize(reply)

        # Play audio
        speaker.start()

        pcm_buffer = deque()
        pcm_buffer.append(pcm_out)

        while len(pcm_buffer) > 0:
            pcm = pcm_buffer.popleft()
            written = speaker.write(pcm)
            if written < len(pcm):
                pcm_buffer.appendleft(pcm[written:])

        speaker.flush()
        speaker.stop()

    except KeyboardInterrupt:
        print("\n[EXIT] Stopping…")
    except pvporcupine.PorcupineActivationLimitError:
        print("AccessKey has reached its processing limit")
    except pvcheetah.CheetahActivationLimitError:
        print("AccessKey has reached its processing limit")
    except pvorca.OrcaActivationLimitError:
        print("AccessKey has reached its processing limit")
    finally:
        # Cleanup
        if speaker is not None:
            speaker.delete()
        
        if recorder is not None:
            recorder.delete()
        
        if orca is not None:
            orca.delete()
        
        if cheetah is not None:
            cheetah.delete()
        
        if porcupine is not None:
            porcupine.delete()

    return 0


if __name__ == "__main__":
    sys.exit(main())

Run the DeepSeek Voice Assistant

To run the voice assistant, update the model path to match your local file and have both API keys ready:

Picovoice AccessKey (copy from the Picovoice Console)
DeepSeek API key (copy from DeepSeek Platform)

python3 voice_deepseek.py \
  --access_key "$ACCESS_KEY" \
  --keyword_paths ./models/hey-deepseek.ppn \
  --deepseek_key "$DEEPSEEK_API_KEY"

You can start building your own commercial or non-commercial projects using Picovoice's self-service Console.

Start Building

Frequently Asked Questions

Will the voice assistant work accurately in noisy environments, with different accents, or with specialized terminology?

Yes. Porcupine Wake Word and Cheetah Streaming Speech-to-Text are designed to work reliably in real-world conditions with background noise and various accents across supported languages. For domain-specific terminology or brand names, you can add boost words and custom vocabulary to Cheetah Streaming Speech-to-Text.

Can I use any wake word phrase for DeepSeek activation?

Yes. You can train custom wake words using Picovoice Console in seconds without collecting training data. Simply enter your desired phrase (e.g., "Hey Computer" or your brand name) and download the trained model.

What happens if the OpenAI API calls fail or timeout?

The local voice processing components (wake word detection, speech-to-text, and text-to-speech) continue functioning independently. You can catch API exceptions and use Orca Text-to-Speech to provide voice feedback like "I'm having trouble connecting, please try again."