Build an AI Voice Note-Taking App with Python

🎯 Voice AI Consulting

Get dedicated support and consultation to ensure your specific needs are met.

TLDR: Learn how to build a fully hands-free Python voice note-taking app. This tutorial covers setting up voice commands to start and stop recording, processing audio with offline speech-to-text, and generating structured summaries using AI.

Voice note-taking applications help users transcribe interviews, capture lecture summaries, and log voice memos. However, manual interaction during these sessions can disrupt the user's focus. This tutorial demonstrates how to build a voice-activated note-taking app that uses distinct start and stop commands for completely hands-free operation.

The implementation uses Porcupine Wake Word for voice activation and Leopard Speech-to-Text for local transcription. Porcupine Wake Word manages the control flow by detecting two custom phrases: a wake word to begin recording (e.g., "Hey Notes") and a stop phrase to finish (e.g., "Done Notes"). This architecture ensures precise capture without manual interaction or premature cutoffs while the user is speaking. Once recording stops, the audio is transcribed locally with Leopard Speech-to-Text, and the text is sent to OpenAI for formatting. This keeps heavy speech processing on-device while leveraging the cloud only for final summarization. By running speech recognition on-device, the AI voice note-taking app eliminates network latency, resulting in more consistent performance.

What You'll Build:

A voice note application that:
- Activates with a custom wake word and stops with a specific phrase
- Captures complete voice notes
- Transcribes recordings on-device
- Generates structured summaries from transcripts
- Operates hands-free

What You'll Need:

Python 3.8+
Microphone
Picovoice AccessKey from the Picovoice Console
OpenAI API key from the OpenAI Platform

Looking for real-time AI summarization? Check out our guide for Meeting Summarization with real-time transcription.

Train a Custom Wake Word and Stop Phrase

Sign up for a Picovoice Console account and navigate to the Porcupine page.
Train your wake word (e.g., "Hey Notes" or "Start Recording"):
- Enter the phrase and test it using the microphone button
- Click "Train", select the target platform, and download the .ppn model file as start-recording.ppn
Train your stop phrase (e.g., "Done Notes" or "Stop Recording"):
- Enter the phrase and test it using the microphone button
- Click "Train", select the target platform, and download the model file as stop-recording.ppn

Select phrases that are phonetically distinct to minimize false positives. See the choosing a wake word guide for best practices.

Set Up the Python Environment

Install the required Python SDKs:

Porcupine Wake Word Python SDK: pvporcupine
Leopard Speech-to-Text Python SDK: pvleopard
Picovoice Python Recorder library: pvrecorder
OpenAI Python library: openai

pip install pvporcupine pvleopard pvrecorder openai

Implement Voice-Activated Controls

The following code captures audio from the default microphone and listens for specific start and stop commands:

import pvporcupine
from pvrecorder import PvRecorder

ACCESS_KEY = "${ACCESS_KEY}"

# Paths to your keyword model files (.ppn)
START_KEYWORD_PATH = "${START_KEYWORD_PATH}"  # e.g., "./models/hey-notes.ppn"
STOP_KEYWORD_PATH = "${STOP_KEYWORD_PATH}"     # e.g., "./models/done-notes.ppn"

# Initialize Porcupine with both keywords
porcupine = pvporcupine.create(
    access_key=ACCESS_KEY,
    keyword_paths=[START_KEYWORD_PATH, STOP_KEYWORD_PATH]
)

recorder = PvRecorder(frame_length=porcupine.frame_length)
recorder.start()

print("Listening for wake word...")

# Wait for start keyword
while True:
    pcm = recorder.read()
    keyword_index = porcupine.process(pcm)
    
    if keyword_index == 0:  # Wake word detected
        print("Wake word detected! Recording...")
        break

# Record audio while listening for stop keyword
audio_buffer = []

while True:
    pcm = recorder.read()
    audio_buffer.extend(pcm)  # Buffer audio for transcription
    
    keyword_index = porcupine.process(pcm)
    if keyword_index == 1:  # Stop phrase detected
        print("Stop phrase detected!")
        break

recorder.stop()
porcupine.delete()

This logic provides explicit control over the recording session, initiating and terminating only by user voice command.

Transcribe Audio

Leopard Speech-to-Text performs batch transcription to convert the audio into text:

import pvleopard

ACCESS_KEY = "${ACCESS_KEY}"

leopard = pvleopard.create(access_key=ACCESS_KEY)

# Transcribe from raw PCM buffer
transcript, words = leopard.process(audio_buffer)

print(f"Transcript: {transcript}")

leopard.delete()

Batch transcription processes the entire file in a single pass. This method generally yields higher accuracy than real-time streaming as the engine utilizes the full context of the sentence to resolve ambiguities.

Leopard Speech-to-Text can also transcribe directly from an audio file.

Generate Structured AI Powered Notes

Finally, the transcript is sent to GPT-4 to organize the raw text into a structured format:

from openai import OpenAI

OPENAI_API_KEY = "${OPENAI_API_KEY}"
client = OpenAI(api_key=OPENAI_API_KEY)

def generate_notes(client, transcript):
    """Generate structured notes from transcript"""
    
    prompt = f"""Convert the following transcript into structured notes.

Transcript:
{transcript}

Generate:
1. A concise title (5-7 words)
2. Key points as bullet points
3. Any action items or follow-ups

Format the response as:
Title: [title]
Key Points:
- [point 1]
- [point 2]
Action Items:
- [item 1]
"""

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a note-taking assistant. Create clear, structured notes."},
            {"role": "user", "content": prompt}
        ]
    )
    
    return response.choices[0].message.content.strip()

By processing the full context only after the user explicitly stops recording, the LLM receives the complete input required for accurate summarization.

Full Python Code for AI Powered Voice Note-Taking App

Here is the complete source code, integrating Porcupine Wake Word for voice commands, Leopard Speech-to-Text for transcription, and OpenAI for AI powered summarization:

import argparse
import sys
from datetime import datetime
from pathlib import Path

import pvporcupine
import pvleopard
from pvrecorder import PvRecorder
from openai import OpenAI


def generate_notes(client, transcript):
    """Generate structured notes from transcript using OpenAI"""
    prompt = f"""Convert the following transcript into structured notes.

Transcript:
{transcript}

Generate:
1. A concise title (5-7 words)
2. Key points as bullet points
3. Any action items or follow-ups

Format the response as:
Title: [title]
Key Points:
- [point 1]
- [point 2]
Action Items:
- [item 1]
"""

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a note-taking assistant. Create clear, structured notes."},
            {"role": "user", "content": prompt}
        ]
    )

    return response.choices[0].message.content.strip()


def main():
    parser = argparse.ArgumentParser(
        description="Voice note-taking with wake word activation and stop phrase"
    )
    parser.add_argument("--access_key", required=True, help="Picovoice AccessKey")
    parser.add_argument("--start_keyword_path", required=True, help="Path to start wake word .ppn model")
    parser.add_argument("--stop_keyword_path", required=True, help="Path to stop phrase .ppn model")
    parser.add_argument("--openai_key", required=True, help="OpenAI API key")
    parser.add_argument("--output_dir", default="./notes", help="Directory to save notes")
    parser.add_argument("--audio_device_index", type=int, default=-1, help="Audio device index")
    parser.add_argument("--show_audio_devices", action="store_true")
    args = parser.parse_args()

    if args.show_audio_devices:
        for i, device in enumerate(PvRecorder.get_available_devices()):
            print(f"Device {i}: {device}")
        return

    # Create output directory
    output_dir = Path(args.output_dir)
    output_dir.mkdir(exist_ok=True)

    # Initialize engines
    porcupine = pvporcupine.create(
        access_key=args.access_key,
        keyword_paths=[args.start_keyword_path, args.stop_keyword_path]
    )

    leopard = pvleopard.create(access_key=args.access_key)

    openai_client = OpenAI(api_key=args.openai_key)

    # Single recorder for everything
    recorder = PvRecorder(
        frame_length=porcupine.frame_length,
        device_index=args.audio_device_index
    )

    print(f"Porcupine version: {porcupine.version}")
    print(f"Leopard version: {leopard.version}")
    print(f"Frame length: {porcupine.frame_length}")

    try:
        while True:
            recorder.start()
            print("\nListening for wake word... (Ctrl+C to exit)")

            # Phase 1: Wait for START wake word
            while True:
                pcm = recorder.read()
                keyword_index = porcupine.process(pcm)

                if keyword_index == 0:  # Start keyword detected
                    print(f"\n[{datetime.now()}] Wake word detected! Recording...")
                    break

            # Phase 2: Record audio while listening for STOP wake word
            audio_buffer = []
            print("Recording... (say stop phrase when done)")

            while True:
                pcm = recorder.read()
                audio_buffer.extend(pcm)  # Buffer audio for transcription

                keyword_index = porcupine.process(pcm)

                if keyword_index == 1:  # Stop keyword detected
                    print(f"\n[{datetime.now()}] Stop phrase detected! Processing...")
                    break

            recorder.stop()

            if not audio_buffer:
                print("No audio captured. Try again.")
                continue

            # Phase 3: Transcribe with Leopard (pass raw PCM directly)
            print("Transcribing...")
            transcript, words = leopard.process(audio_buffer)
            print(f"\n[TRANSCRIPT]\n{transcript}\n")

            if not transcript.strip():
                print("No speech detected in recording.")
                continue

            # Phase 4: Generate notes with OpenAI
            print("Generating structured notes...")
            notes = generate_notes(openai_client, transcript)

            # Save notes
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            notes_path = output_dir / f"note_{timestamp}.txt"
            with open(notes_path, 'w') as f:
                f.write(f"Timestamp: {timestamp}\n\n")
                f.write(f"Transcript:\n{transcript}\n\n")
                f.write(f"Notes:\n{notes}\n")

            print(f"\n{'=' * 60}")
            print("GENERATED NOTES:")
            print('=' * 60)
            print(notes)
            print('=' * 60)
            print(f"\nSaved to: {notes_path}")

            response = input("\nRecord another note? (y/n): ")
            if response.lower() != 'y':
                break

    except KeyboardInterrupt:
        print("\nStopping...")
    except pvporcupine.PorcupineActivationLimitError:
        print("AccessKey has reached its processing limit")
    except pvleopard.LeopardActivationLimitError:
        print("AccessKey has reached its processing limit")
    finally:
        recorder.delete()
        leopard.delete()
        porcupine.delete()


if __name__ == "__main__":
    sys.exit(main())

Run the Voice Note-Taking App

To run the AI note taking application, update the model paths to match your local files and ensure both API keys are available:

Picovoice AccessKey (from Picovoice Console)
OpenAI API key

python3 voice_notes.py \
  --access_key "$ACCESS_KEY" \
  --start_keyword_path ./models/hey-notes.ppn \
  --stop_keyword_path ./models/done-notes.ppn \
  --openai_key "$OPENAI_API_KEY" \
  --output_dir ./my_notes

You can start building your own commercial or non-commercial projects leveraging Picovoice's self-service Console.

Start Building

Frequently Asked Questions

Will voice notes work accurately in noisy environments?

Yes. Porcupine Wake Word and Leopard Speech-to-Text are designed for real-world conditions, including background noise.

Can I accidentally trigger the stop phrase while speaking?

False positives are minimized by selecting distinct phrases. Choose a stop phrase that is unlikely to come up in natural conversation in a meeting. Testing keywords in the Picovoice Console before deployment ensures they do not trigger on common words.

What happens if the OpenAI API fails?

Network issues can occasionally interrupt the AI summarization step. However, because transcription occurs on-device, the raw text can be saved locally. The summary generation can be retried later using the saved transcript file.

How does batch transcription differ from real-time transcription?

Batch transcription processes the full audio file after recording is complete, whereas real-time transcription processes audio as it is spoken. For note-taking, batch processing often yields higher accuracy because the engine analyzes the full context of sentences before finalizing text.

Can I customize the note format?

Yes. The system prompt sent to the LLM can be modified to change the structure—adding categories, tags, or priority levels. Local logic can also be implemented to sort notes automatically based on content.