OpenAI Whisper delivers highly accurate speech-to-text transcription, but it does not track speaker changes. Applications that rely on Whisper cannot determine who is speaking in a conversation. To add speaker labels such as "Speaker 1" and "Speaker 2," you can integrate Falcon Speaker Diarization with whisper.cpp, creating a fully local, offline, multi-speaker transcription system.
This tutorial explains how to add speaker segmentation to Whisper, so you can determine "who spoke when" and generate timestamps to produce labeled Whisper transcripts suitable for multi-speaker recordings. This approach is ideal for use cases such as podcast transcriptions, meeting transcription and summarization, or call center analytics, where speaker identification is essential for readability and downstream analysis.
Add Speaker Diarization to OpenAI Whisper CPP in 3 Steps:
- Transcribe audio using whisper.cpp, producing text with timestamps.
- Run Falcon Speaker Diarization on the same audio file to generate speaker-labeled segments.
- Merge the results into a single transcript with accurate speaker labels.
We'll use modern C++ with whisper.cpp for the transcription pipeline, dr_wav for audio file handling, and the Speaker Diarization C API for speaker diarization. The implementation works cross-platform on Linux, macOS, Windows, and Raspberry Pi, requires no cloud services, processes audio files locally for privacy, and runs efficiently on standard hardware.
An alternative solution to Whisper Speech-to-Text is Leopard Speech-to-Text, a highly accurate and efficient transcription engine that provides built-in speaker diarization through its word metadata feature.
Equivalent Implementation in Python: To see the Python version of this implementation, check out Whisper Falcon Integration in Python.
Table of Contents
- Prerequisites
- Getting Started
- Part 1: Speech Transcription with Whisper
- Part 2: Speaker Diarization with Falcon
- Part 3: Merge Transcript with Speaker Labels
- Complete Code
- Troubleshooting Common Issues
- Frequently Asked Questions
Prerequisites
- C++ compiler (for Whisper)
- C99-compatible compiler (for Falcon)
- Windows: MinGW
- A Picovoice
AccessKey(Get your free AccessKey)
Supported Platforms
- Linux: x86_64
- macOS: x86_64 and arm64
- Windows: x86_64 and arm64
- Raspberry Pi: Models 3, 4, and 5
Getting Started
Recommended Project Structure
This is the folder structure used in this tutorial. You can organize your files differently if you like, but make sure to update the paths in the examples accordingly:
Step 1: Add Dependencies
1a. Download Whisper and the Speech Recognition Model
whisper.cpp is the C/C++ port of OpenAI Whisper. It handles audio-to-text transcription.
From your project root, download a Whisper model (we'll use base.en):
After downloading, confirm the model file is located at:
1b. Add the WAV Audio File Handler
dr_wav gives us a WAV loader compatible with Whisper's required PCM float format.
1c. Add Falcon Speaker Diarization
- Create a folder named
falcon/at your project root. - Copy the header files from the Falcon GitHub repository into:
- Copy the Falcon model file (falcon_params.pv) and the correct library file for your platform (
.so,.dylib, or.dll) and place them in:
Step 2: Configure the Build System
Create a CMakeLists.txt file that links all components together:
Step 3: Set Up Your C++ Application
3a. Import Headers
Create the main application file (whisper_falcon.cpp) and import the required headers. Update the placeholder paths as needed:
- On Windows systems,
windows.hprovides theLoadLibraryfunction to load a shared library andGetProcAddressto retrieve individual function pointers. - On Unix-based systems,
dlopenanddlsymfrom thedlfcn.hheader provide the same functionality.
The platform-specific headers enable dynamic loading of the Falcon library. Unlike static linking where libraries are bundled at compile time, dynamic loading loads shared libraries at runtime.
3b. Define dynamic loading helper functions
Define helper functions to open the shared library, load function symbols, close the library, and print platform-correct errors. We'll use these later when implementing Falcon Speaker Diarization.
Part 1: Speech Transcription with Whisper
Step 4: Initialize the Whisper Model
Step 5. Load and Prepare Audio Data
Whisper requires audio in a specific format: 32-bit floating-point PCM samples. Use dr_wav to load a WAV file and convert it automatically to the correct format:
You can use other audio file formats, as long you convert the audio to pcmf32 for Whisper.
Step 6: Generate the Transcript
Process the audio through Whisper using beam search:
After processing, Whisper returns time-stamped segments, each containing transcribed text. Iterate through each segment to see the transcribed text:
These segments will later be matched with speaker labels.
Part 2: Speaker Diarization with Falcon
With the transcription step complete, let's add speaker labels.
Step 7: Load Dynamic Library
7a. Open the Shared Library
Load the Falcon Speaker Diarization library:
7b. Load Required Functions
Load the Falcon API functions from the shared library by retrieving their addresses and storing them as function pointers:
Each function is loaded in two steps:
- use
typedefto define a function pointer type - use
load_symbolto find the function by name in the library and cast it to the correct type
These pointers can then be called like regular functions throughout your code.
Step 8: Initialize Falcon Speaker Diarization
Create a Falcon instance using your AccessKey and model file (falcon_params.pv).
If you haven't already, sign up for a free account on Picovoice Console and copy your AccessKey from the dashboard.
The device parameter lets you choose what hardware the engine runs on.
You can set it to 'best' to automatically pick the most suitable option, or specify a device yourself. For example, 'gpu' uses the first available GPU, while 'gpu:0' or 'gpu:1' targets a specific GPU. If you want to run on the CPU, use 'cpu', or control the number of CPU threads with something like 'cpu:4'.
See available devices (optional)
To see the list of available hardware devices, use pv_falcon_list_hardware_devices:
Step 9: Label Speakers in the Audio
Process the audio file through Falcon. Falcon returns time-stamped segments, each labeled with a unique speaker ID.
Each Falcon segment contains:
- start_sec: When the speaker starts talking
- end_sec: When the speaker stops talking
- speaker_tag: A unique identifier for this speaker (0, 1, 2, etc.)
Part 3. Merge Transcript with Speaker Labels
Step 10. Match Speakers to Transcript Segments
Match each transcript segment with the corresponding speaker by comparing timestamps. When a Whisper segment overlaps with a Falcon segment, assign that speaker tag to the transcribed text.
Cleanup Resources
When done, clean up resources to free memory:
Complete Code: Audio Transcription with Speaker Labels in C++
Here's the complete whisper_falcon.cpp file bringing all the steps together:
whisper_falcon.cpp
Build & Run the Application
Before building, verify:
AUDIO_FILE_PATHpoints to a valid WAV fileWHISPER_MODEL_PATHmatches your downloaded model locationPV_FALCON_MODEL_PATHpoints to falcon_params.pvPV_FALCON_LIBRARY_PATHpoints to:- Linux: x86_64/libpv_falcon.so
- macOS: x86_64/libpv_falcon.dylib or arm64/libpv_falcon.dylib
- Windows: amd64/libpv_falcon.dll or arm64/libpv_falcon.dll
PV_ACCESS_KEYcontains your Picovoice access key
Build
In the project root containing CMakeLists.txt:
Run
Expected Output Example
Troubleshooting Common Issues
whisper_init_from_file_with_params_no_state: failed to open
- Run the Whisper model download script:
sh ./models/download-ggml-model.sh base.en - Verify the model exists at:
whisper.cpp/models/ggml-base.en.bin - Update
WHISPER_MODEL_PATHpath to match your actual model location
Failed to init Falcon
- Confirm
falcon_params.pvis in thefalcon/directory - Check that
PV_FALCON_MODEL_PATHpoints to the correct location - Download the model file from: Falcon model on GitHub
Failed to open WAV file
- Verify the WAV file exists at the path specified in
AUDIO_FILE_PATH - Ensure the audio file is a valid WAV format (Falcon supports other formats, but this example expects WAV)
Application runs but produces no output
- Ensure your audio file contains speech (test with a known recording)
- Check that the audio is not corrupted by playing it in a media player
Frequently Asked Questions
whisper.cpp requires audio in PCM float format, which this tutorial handles through dr_wav for WAV files. Falcon itself supports many formats including 3gp (AMR), FLAC, MP3, MP4/m4a (AAC), Ogg, WAV, and WebM. To use other formats, you'd need to add an appropriate audio decoder (like FFmpeg) to convert audio to PCM float for Whisper.
Falcon can diarize unlimited speakers in a conversation. It automatically assigns unique speaker tags (0, 1, 2, etc.) as it detects different voices.
Yes. Use the multilingual Whisper models (without .en suffix) like base, small, or large. Falcon performs speaker diarization independent of language—it analyzes voice characteristics, not speech content.
Yes. Picovoice provides Python bindings for Falcon, and OpenAI Whisper has an official Python package. Check out our guide Adding Speaker Diarization to OpenAI Whisper in Python for guidance.







