Mistral Voice Mode in Python: On-Device Alternative to Voxtral API

🚀 On-device Voice AI & LLMs

Build commercial, non-commercial, research projects using the Forever-Free Plan.

Mistral AI released Voxtral, their open-source speech understanding model that powers voice mode in Le Chat assistant. Voxtral is available through Mistral's API for cloud-based transcription and as downloadable models for on-device deployment. However, building a production-ready Mistral voice assistant requires additional components beyond what Voxtral provides: wake word detection for hands-free voice activation, real-time speech-to-text that processes audio as users speak, and text-to-speech synthesis for natural voice responses.

Additionally, self-hosting Voxtral for real-time voice mode can require GPU infrastructure and machine learning integration expertise, while Mistral's transcription API still requires separate wake word and voice output solutions. These integration challenges create barriers for developers who want to build voice chat applications with Mistral AI.

This tutorial demonstrates how to build a complete Mistral AI voice assistant with voice mode capabilities in Python. By combining Porcupine Wake Word for custom wake word detection, Cheetah Streaming Speech-to-Text for low-latency speech recognition, and Orca Streaming Text-to-Speech for natural voice responses, developers can create conversational AI voice interfaces that run locally with low latency. No GPU infrastructure or ML expertise required.

What You'll Build:

A hands-free Mistral voice assistant that:

Activates using a custom wake word
Transcribes speech in real-time locally
Sends recognized text to Mistral AI for intelligent responses
Speaks Mistral's response using local text-to-speech

This architecture supports building multilingual voicebots, AI-powered voice agents, and interactive voice applications for enterprise and consumer use cases.

What You'll Need:

Python 3.9+
Microphone and speakers
Picovoice AccessKey from the Picovoice Console
Mistral API key from the Mistral AI Platform

Looking to integrate voice with other AI chatbots? See our guides for ChatGPT Voice Assistant and DeepSeek Voice Assistant.

Train a Custom Wake Word for Mistral Activation

Sign up for a Picovoice Console account and navigate to the Porcupine page.
Enter your wake phrase such as "Hey Chatbot" and test it using the microphone button.
Click "Train", select the target platform, and download the .ppn model file.
Repeat steps 2 & 3 for any additional wake words you would like to support (e.g., "Hey Assistant").

Porcupine can detect multiple wake words with no added runtime footprint. For instance, use "Hey Chatbot" and "Dis le chat" ("Dee luh shah") simultaneously to activate the Mistral voice assistant. For tips on designing an effective wake word, review the choosing a wake word guide.

Set Up the Python Environment

Install all required Python SDKs and dependencies with a single command in the terminal:

Porcupine Wake Word Python SDK: pvporcupine
Cheetah Streaming Speech-to-Text Python SDK: pvcheetah
Orca Streaming Text-to-Speech Python SDK: pvorca
Picovoice Python Recorder library: pvrecorder
Picovoice Python Speaker library: pvspeaker
Mistral AI Python SDK: mistralai — used for sending API calls to Mistral

pip install pvporcupine pvcheetah pvorca pvrecorder pvspeaker mistralai

Implement Wake Word Detection

The following code captures audio from your default microphone and detects the custom wake word locally:

import pvporcupine
from pvrecorder import PvRecorder

ACCESS_KEY = "${ACCESS_KEY}"

# Path to your Porcupine wake-word model file (.ppn) that triggers activation
# e.g., "./models/hey-mistral.ppn" 
KEYWORD_PATH = "${KEYWORD_PATH}"

porcupine = pvporcupine.create(access_key=ACCESS_KEY, keyword_paths=[KEYWORD_PATH])
recorder = PvRecorder(frame_length=porcupine.frame_length)
recorder.start()

print("Listening for wake word...")
while True:
    pcm = recorder.read()
    keyword_index = porcupine.process(pcm)
    if keyword_index >= 0:
        print("Wake word detected.")
        break

recorder.stop()

Add Real-Time Speech-to-Text Transcription

After wake word detection, capture audio frames and transcribe them in real-time with Cheetah Streaming Speech-to-Text:

import pvcheetah

ACCESS_KEY = "${ACCESS_KEY}"

cheetah = pvcheetah.create(
            access_key=ACCESS_KEY,
            endpoint_duration_sec= 1.0)

recorder = PvRecorder(frame_length=cheetah.frame_length)
recorder.start()

print("Speak your message…")
transcript = ""
while True:
    pcm = recorder.read()
    partial_transcript, is_endpoint = cheetah.process(pcm)
    transcript += partial_transcript
    print(partial_transcript, end="", flush=True)
    if is_endpoint:
        final_transcript = cheetah.flush()
        transcript += final_transcript
        print(final_transcript)
        break

recorder.stop()
cheetah.delete()

Once you make a natural pause in your speech, such as after asking a question, Cheetah detects it as an endpoint, signaling that you've finished speaking.

Send Transcribed Text to Mistral AI API

Send the transcribed text to Mistral AI using the chat completions endpoint:

from mistralai import Mistral

MISTRAL_API_KEY = "${MISTRAL_API_KEY}"

def ask_mistral(api_key: str, prompt: str, model: str = "mistral-medium-latest") -> str:
    try:
        client = Mistral(api_key=api_key)
        response = client.chat.complete(
            model=model,
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"I'm having trouble connecting to Mistral. Error: {e}"
        
 # transcript = "example user request"       
reply = ask_mistral(MISTRAL_API_KEY, transcript)

Convert Mistral AI's Response to Speech

Transform Mistral's text response into natural speech using Orca Streaming Text-to-Speech and PvSpeaker:

import pvorca
from pvspeaker import PvSpeaker
from collections import deque

ACCESS_KEY = "${ACCESS_KEY}"
orca = pvorca.create(access_key=ACCESS_KEY)
speaker = PvSpeaker(sample_rate=orca.sample_rate, bits_per_sample=16)

# Synthesize speech
# reply = response from mistral
pcm_out, _ = orca.synthesize(reply)

# Play audio
speaker.start()

pcm_buffer = deque()
pcm_buffer.append(pcm_out)

while len(pcm_buffer) > 0:
    pcm = pcm_buffer.popleft()
    written = speaker.write(pcm)
    if written < len(pcm):
        pcm_buffer.appendleft(pcm[written:])

speaker.flush()
speaker.stop()

# Cleanup
speaker.delete()
orca.delete()

In this example, Orca performs single synthesis because the Mistral API returns the full response at once. When used with streaming models, Orca Streaming Text-to-Speech can generate and play audio in real time through streaming synthesis, enabling significantly lower latency than cloud-based alternatives.

Full Python Code for Mistral Voice Assistant

This complete implementation integrates wake word detection, streaming speech-to-text, Mistral API calls, and text-to-speech synthesis:

import argparse
from collections import deque
import sys
from mistralai import Mistral
import pvporcupine
import pvcheetah
import pvorca
from pvrecorder import PvRecorder
from pvspeaker import PvSpeaker


def main() -> int:
    parser = argparse.ArgumentParser(
        description="Porcupine + Cheetah + Orca voice interface for Mistral AI"
    )
    parser.add_argument("--access_key", required=True, help="Picovoice AccessKey")
    parser.add_argument("--keyword_paths", nargs='+', required=True, 
                       help="Path(s) to .ppn wake-word model(s)")
    parser.add_argument("--mistral_key", required=True, help="Mistral API key")
    args = parser.parse_args()

    porcupine = None
    cheetah = None
    orca = None
    recorder = None
    speaker = None

    try:
        # Initialize engines
        porcupine = pvporcupine.create(
            access_key=args.access_key, 
            keyword_paths=args.keyword_paths)
        cheetah = pvcheetah.create(access_key=args.access_key, endpoint_duration_sec=1.0)
        orca = pvorca.create(access_key=args.access_key)

        print(f'Porcupine version: {porcupine.version}')
        print(f'Cheetah version: {cheetah.version}')
        print(f'Orca version: {orca.version}\n')

        # Initialize speaker
        speaker = PvSpeaker(
            sample_rate=orca.sample_rate, 
            bits_per_sample=16)

        # Initialize recorder
        recorder = PvRecorder(frame_length=porcupine.frame_length)
        recorder.start()

        print("Ready. Say the wake word… (Ctrl+C to stop)")

        # Wait for wake word
        while True:
            pcm = recorder.read()
            keyword_index = porcupine.process(pcm)
            if keyword_index >= 0:
                print("[EVENT] Wake word detected")
                break

        recorder.stop()
        recorder.delete()
        recorder = PvRecorder(frame_length=cheetah.frame_length)
        recorder.start()

        # Stream STT with Cheetah
        print("Speak your message…")
        transcript = ""
        while True:
            pcm = recorder.read()
            partial_transcript, is_endpoint = cheetah.process(pcm)
            transcript += partial_transcript
            print(partial_transcript, end="", flush=True)
            if is_endpoint:
                final_transcript = cheetah.flush()
                transcript += final_transcript
                print(final_transcript)
                break

        print("[TRANSCRIPT]", transcript)

        # Call Mistral AI
        client = Mistral(api_key=args.mistral_key)
        response = client.chat.complete(
            model="mistral-medium-latest",
            messages=[{"role": "user", "content": transcript}]
        )

        reply = response.choices[0].message.content
        print("[REPLY]", reply)

        # Synthesize speech with Orca
        pcm_out, _ = orca.synthesize(reply)

        # Play audio
        speaker.start()

        pcm_buffer = deque()
        pcm_buffer.append(pcm_out)

        while len(pcm_buffer) > 0:
            pcm = pcm_buffer.popleft()
            written = speaker.write(pcm)
            if written < len(pcm):
                pcm_buffer.appendleft(pcm[written:])

        speaker.flush()
        speaker.stop()

    except KeyboardInterrupt:
        print("\n[EXIT] Stopping…")
    except pvporcupine.PorcupineActivationLimitError:
        print("AccessKey has reached its processing limit")
    except pvcheetah.CheetahActivationLimitError:
        print("AccessKey has reached its processing limit")
    except pvorca.OrcaActivationLimitError:
        print("AccessKey has reached its processing limit")
    finally:
        # Cleanup
        if speaker is not None:
            speaker.delete()
        
        if recorder is not None:
            recorder.delete()
        
        if orca is not None:
            orca.delete()
        
        if cheetah is not None:
            cheetah.delete()
        
        if porcupine is not None:
            porcupine.delete()

    return 0


if __name__ == "__main__":
    sys.exit(main())

Run the Mistral Voice Assistant

To run the voice assistant, update the model path to match your local file and have both API keys ready:

Picovoice AccessKey (copy from the Picovoice Console)
Mistral API key (copy from Mistral AI Platform)

python3 voice_mistral.py \
  --access_key "$ACCESS_KEY" \
  --keyword_paths ./models/hey-mistral.ppn \  # Path to your .ppn model file
  --mistral_key "$MISTRAL_API_KEY"

You can start building your own commercial or non-commercial projects using Picovoice's self-service Console.

Start Building

Frequently Asked Questions

Will the voice assistant work accurately in noisy environments, with different accents, or with specialized terminology?

Yes. Porcupine Wake Word and Cheetah Streaming Speech-to-Text are designed to work reliably in real-world conditions with background noise and various accents across supported languages. For domain-specific terminology or brand names, you can add boost words and custom vocabulary to Cheetah Streaming Speech-to-Text.

Can I use any wake word phrase for Mistral activation?

Yes. You can train custom wake words using Picovoice Console in seconds without collecting training data. Simply enter your desired phrase (e.g., "Hey Assistant" or your brand name) and download the trained model.

What happens if the Mistral API is unavailable?

The local voice processing components (wake word detection, speech-to-text, and text-to-speech) continue functioning independently. You can catch API exceptions and use Orca Text-to-Speech to provide voice feedback like "I'm having trouble connecting, please try again."