🎯 Voice AI Consulting
Get dedicated support and consultation to ensure your specific needs are met.
Consult an AI Expert

OpenAI Whisper delivers highly accurate speech-to-text transcription, but it does not track speaker changes. Applications that rely on Whisper cannot determine who is speaking in a conversation. To add speaker labels such as "Speaker 1" and "Speaker 2," you can integrate Falcon Speaker Diarization with whisper.cpp, creating a fully local, offline, multi-speaker transcription system.

This tutorial explains how to add speaker segmentation to Whisper, so you can determine "who spoke when" and generate timestamps to produce labeled Whisper transcripts suitable for multi-speaker recordings. This approach is ideal for use cases such as podcast transcriptions, meeting transcription and summarization, or call center analytics, where speaker identification is essential for readability and downstream analysis.

Add Speaker Diarization to OpenAI Whisper CPP in 3 Steps:

  1. Transcribe audio using whisper.cpp, producing text with timestamps.
  2. Run Falcon Speaker Diarization on the same audio file to generate speaker-labeled segments.
  3. Merge the results into a single transcript with accurate speaker labels.

We'll use modern C++ with whisper.cpp for the transcription pipeline, dr_wav for audio file handling, and the Speaker Diarization C API for speaker diarization. The implementation works cross-platform on Linux, macOS, Windows, and Raspberry Pi, requires no cloud services, processes audio files locally for privacy, and runs efficiently on standard hardware.

An alternative solution to Whisper Speech-to-Text is Leopard Speech-to-Text, a highly accurate and efficient transcription engine that provides built-in speaker diarization through its word metadata feature.

Equivalent Implementation in Python: To see the Python version of this implementation, check out Whisper Falcon Integration in Python.

Table of Contents


Prerequisites

  • C++ compiler (for Whisper)
  • C99-compatible compiler (for Falcon)
  • Windows: MinGW
  • A Picovoice AccessKey (Get your free AccessKey)

Supported Platforms

  • Linux: x86_64
  • macOS: x86_64 and arm64
  • Windows: x86_64 and arm64
  • Raspberry Pi: Models 3, 4, and 5

Getting Started

This is the folder structure used in this tutorial. You can organize your files differently if you like, but make sure to update the paths in the examples accordingly:

Step 1: Add Dependencies

1a. Download Whisper and the Speech Recognition Model

whisper.cpp is the C/C++ port of OpenAI Whisper. It handles audio-to-text transcription.

From your project root, download a Whisper model (we'll use base.en):

After downloading, confirm the model file is located at:

1b. Add the WAV Audio File Handler

dr_wav gives us a WAV loader compatible with Whisper's required PCM float format.

1c. Add Falcon Speaker Diarization

  1. Create a folder named falcon/ at your project root.
  2. Copy the header files from the Falcon GitHub repository into:
  1. Copy the Falcon model file (falcon_params.pv) and the correct library file for your platform (.so, .dylib, or .dll) and place them in:

Step 2: Configure the Build System

Create a CMakeLists.txt file that links all components together:

Step 3: Set Up Your C++ Application

3a. Import Headers

Create the main application file (whisper_falcon.cpp) and import the required headers. Update the placeholder paths as needed:

  • On Windows systems, windows.h provides the LoadLibrary function to load a shared library and GetProcAddress to retrieve individual function pointers.
  • On Unix-based systems, dlopen and dlsym from the dlfcn.h header provide the same functionality.

The platform-specific headers enable dynamic loading of the Falcon library. Unlike static linking where libraries are bundled at compile time, dynamic loading loads shared libraries at runtime.

3b. Define dynamic loading helper functions

Define helper functions to open the shared library, load function symbols, close the library, and print platform-correct errors. We'll use these later when implementing Falcon Speaker Diarization.


Part 1: Speech Transcription with Whisper

Step 4: Initialize the Whisper Model

Step 5. Load and Prepare Audio Data

Whisper requires audio in a specific format: 32-bit floating-point PCM samples. Use dr_wav to load a WAV file and convert it automatically to the correct format:

You can use other audio file formats, as long you convert the audio to pcmf32 for Whisper.

Step 6: Generate the Transcript

Process the audio through Whisper using beam search:

After processing, Whisper returns time-stamped segments, each containing transcribed text. Iterate through each segment to see the transcribed text:

These segments will later be matched with speaker labels.


Part 2: Speaker Diarization with Falcon

With the transcription step complete, let's add speaker labels.

Step 7: Load Dynamic Library

7a. Open the Shared Library

Load the Falcon Speaker Diarization library:

7b. Load Required Functions

Load the Falcon API functions from the shared library by retrieving their addresses and storing them as function pointers:

Each function is loaded in two steps:

  1. use typedef to define a function pointer type
  2. use load_symbol to find the function by name in the library and cast it to the correct type

These pointers can then be called like regular functions throughout your code.

Step 8: Initialize Falcon Speaker Diarization

Create a Falcon instance using your AccessKey and model file (falcon_params.pv).

If you haven't already, sign up for a free account on Picovoice Console and copy your AccessKey from the dashboard.

The device parameter lets you choose what hardware the engine runs on.

You can set it to 'best' to automatically pick the most suitable option, or specify a device yourself. For example, 'gpu' uses the first available GPU, while 'gpu:0' or 'gpu:1' targets a specific GPU. If you want to run on the CPU, use 'cpu', or control the number of CPU threads with something like 'cpu:4'.

See available devices (optional)

To see the list of available hardware devices, use pv_falcon_list_hardware_devices:

Step 9: Label Speakers in the Audio

Process the audio file through Falcon. Falcon returns time-stamped segments, each labeled with a unique speaker ID.

Each Falcon segment contains:

  • start_sec: When the speaker starts talking
  • end_sec: When the speaker stops talking
  • speaker_tag: A unique identifier for this speaker (0, 1, 2, etc.)

Part 3. Merge Transcript with Speaker Labels

Step 10. Match Speakers to Transcript Segments

Match each transcript segment with the corresponding speaker by comparing timestamps. When a Whisper segment overlaps with a Falcon segment, assign that speaker tag to the transcribed text.

Cleanup Resources

When done, clean up resources to free memory:


Complete Code: Audio Transcription with Speaker Labels in C++

Here's the complete whisper_falcon.cpp file bringing all the steps together:

whisper_falcon.cpp

Build & Run the Application

Before building, verify:

Build

In the project root containing CMakeLists.txt:

Run

Expected Output Example


Troubleshooting Common Issues

whisper_init_from_file_with_params_no_state: failed to open

  • Run the Whisper model download script: sh ./models/download-ggml-model.sh base.en
  • Verify the model exists at: whisper.cpp/models/ggml-base.en.bin
  • Update WHISPER_MODEL_PATH path to match your actual model location

Failed to init Falcon

  • Confirm falcon_params.pv is in the falcon/ directory
  • Check that PV_FALCON_MODEL_PATH points to the correct location
  • Download the model file from: Falcon model on GitHub

Failed to open WAV file

  • Verify the WAV file exists at the path specified in AUDIO_FILE_PATH
  • Ensure the audio file is a valid WAV format (Falcon supports other formats, but this example expects WAV)

Application runs but produces no output

  • Ensure your audio file contains speech (test with a known recording)
  • Check that the audio is not corrupted by playing it in a media player

Frequently Asked Questions

What audio formats can I use with whisper.cpp and Falcon Speaker Diarization?

whisper.cpp requires audio in PCM float format, which this tutorial handles through dr_wav for WAV files. Falcon itself supports many formats including 3gp (AMR), FLAC, MP3, MP4/m4a (AAC), Ogg, WAV, and WebM. To use other formats, you'd need to add an appropriate audio decoder (like FFmpeg) to convert audio to PCM float for Whisper.

What's the maximum number of speakers Falcon Speaker Diarization can handle?

Falcon can diarize unlimited speakers in a conversation. It automatically assigns unique speaker tags (0, 1, 2, etc.) as it detects different voices.

Can I use my Whisper CPP Speaker Diarization app for all languages?

Yes. Use the multilingual Whisper models (without .en suffix) like base, small, or large. Falcon performs speaker diarization independent of language—it analyzes voice characteristics, not speech content.

Can I add Speaker Diarization to Whisper using Python?

Yes. Picovoice provides Python bindings for Falcon, and OpenAI Whisper has an official Python package. Check out our guide Adding Speaker Diarization to OpenAI Whisper in Python for guidance.