Step-by-Step Guide: Add Speaker Diarization to OpenAI Whisper in C++

🎯 Voice AI Consulting

Get dedicated support and consultation to ensure your specific needs are met.

OpenAI Whisper delivers highly accurate speech-to-text transcription, but it does not track speaker changes. Applications that rely on Whisper cannot determine who is speaking in a conversation. To add speaker labels such as "Speaker 1" and "Speaker 2," you can integrate Falcon Speaker Diarization with whisper.cpp, creating a fully local, offline, multi-speaker transcription system.

This tutorial explains how to add speaker segmentation to Whisper, so you can determine "who spoke when" and generate timestamps to produce labeled Whisper transcripts suitable for multi-speaker recordings. This approach is ideal for use cases such as podcast transcriptions, meeting transcription and summarization, or call center analytics, where speaker identification is essential for readability and downstream analysis.

Add Speaker Diarization to OpenAI Whisper CPP in 3 Steps:

Transcribe audio using whisper.cpp, producing text with timestamps.
Run Falcon Speaker Diarization on the same audio file to generate speaker-labeled segments.
Merge the results into a single transcript with accurate speaker labels.

We'll use modern C++ with whisper.cpp for the transcription pipeline, dr_wav for audio file handling, and the Speaker Diarization C API for speaker diarization. The implementation works cross-platform on Linux, macOS, Windows, and Raspberry Pi, requires no cloud services, processes audio files locally for privacy, and runs efficiently on standard hardware.

An alternative solution to Whisper Speech-to-Text is Leopard Speech-to-Text, a highly accurate and efficient transcription engine that provides built-in speaker diarization through its word metadata feature.

Equivalent Implementation in Python: To see the Python version of this implementation, check out Whisper Falcon Integration in Python.

Prerequisites
- Supported Platforms
Getting Started
Part 1: Speech Transcription with Whisper
Part 2: Speaker Diarization with Falcon
Part 3: Merge Transcript with Speaker Labels
- Step 10: Match Speakers to Transcript Segments
- Cleanup Resources
Complete Code
- Build & Run the Application
Troubleshooting Common Issues
Frequently Asked Questions

Prerequisites

C++ compiler (for Whisper)
C99-compatible compiler (for Falcon)
Windows: MinGW
A Picovoice AccessKey (Get your free AccessKey)

Supported Platforms

Linux: x86_64
macOS: x86_64 and arm64
Windows: x86_64 and arm64
Raspberry Pi: Models 3, 4, and 5

Getting Started

Recommended Project Structure

This is the folder structure used in this tutorial. You can organize your files differently if you like, but make sure to update the paths in the examples accordingly:

project_root/
├── CMakeLists.txt
├── whisper_falcon.cpp
├── whisper.cpp/                    # clone from https://github.com/ggml-org/whisper.cpp.git
├── dr_libs/                        # clone from https://github.com/mackron/dr_libs
└── falcon/                         # this folder will be created in the next step.
    ├── falcon_params.pv
    ├── libpv_falcon.{so|dylib|dll} # adjust per platform
    └── include/
        ├── picovoice.h 
        └── pv_falcon.h

Step 1: Add Dependencies

1a. Download Whisper and the Speech Recognition Model

whisper.cpp is the C/C++ port of OpenAI Whisper. It handles audio-to-text transcription.

From your project root, download a Whisper model (we'll use base.en):

git clone https://github.com/ggml-org/whisper.cpp.git
cd whisper.cpp
sh ./models/download-ggml-model.sh base.en

After downloading, confirm the model file is located at:

whisper.cpp/models/ggml-base.en.bin

1b. Add the WAV Audio File Handler

git clone https://github.com/mackron/dr_libs.git

dr_wav gives us a WAV loader compatible with Whisper's required PCM float format.

1c. Add Falcon Speaker Diarization

Create a folder named falcon/ at your project root.
Copy the header files from the Falcon GitHub repository into:

falcon/include/

Copy the Falcon model file (falcon_params.pv) and the correct library file for your platform (.so, .dylib, or .dll) and place them in:

falcon/

Step 2: Configure the Build System

Create a CMakeLists.txt file that links all components together:

cmake_minimum_required(VERSION 3.13)
project(whisper_falcon CXX)

# import Whisper as a CMake subproject
add_subdirectory(whisper.cpp whisper_lib)

# create application binary: whisper_falcon
add_executable(whisper_falcon whisper_falcon.cpp)

# set header search paths
target_include_directories(whisper_falcon PRIVATE whisper.cpp/include falcon/include dr_libs)

# link executable against Whisper library
target_link_libraries(whisper_falcon PRIVATE whisper)

Step 3: Set Up Your C++ Application

3a. Import Headers

Create the main application file (whisper_falcon.cpp) and import the required headers. Update the placeholder paths as needed:

#if defined(_WIN32) || defined(_WIN64)
#include <windows.h>
#else
#include <dlfcn.h>
#endif

#include <cstdio>
#include <vector>
#include <cstring>

#include "whisper.h"
#include "pv_falcon.h"

#define DR_WAV_IMPLEMENTATION
#include "dr_wav.h"

// Update these as needed
#define AUDIO_FILE_PATH "./multiple_speakers.wav"
#define WHISPER_MODEL_PATH "./whisper.cpp/models/ggml-base.en.bin"
#define PV_FALCON_MODEL_PATH "./falcon/falcon_params.pv"
#define PV_FALCON_LIBRARY_PATH "./falcon/libpv_falcon.{so|dylib|dll}"
#define PV_ACCESS_KEY "YOUR_ACCESS_KEY_HERE"

On Windows systems, windows.h provides the LoadLibrary function to load a shared library and GetProcAddress to retrieve individual function pointers.
On Unix-based systems, dlopen and dlsym from the dlfcn.h header provide the same functionality.

The platform-specific headers enable dynamic loading of the Falcon library. Unlike static linking where libraries are bundled at compile time, dynamic loading loads shared libraries at runtime.

3b. Define dynamic loading helper functions

Define helper functions to open the shared library, load function symbols, close the library, and print platform-correct errors. We'll use these later when implementing Falcon Speaker Diarization.

static void *open_dl(const char *dl_path) {
#if defined(_WIN32) || defined(_WIN64)
    return LoadLibrary((LPCWSTR) dl_path);
#else
    return dlopen(dl_path, RTLD_NOW);
#endif
}

static void *load_symbol(void *handle, const char *symbol) {
#if defined(_WIN32) || defined(_WIN64)
    return (void *)(uintptr_t)GetProcAddress((HMODULE)handle, symbol);
#else
    return dlsym(handle, symbol);
#endif
}

static void close_dl(void *handle) {
#if defined(_WIN32) || defined(_WIN64)
    FreeLibrary((HMODULE)handle);
#else
    dlclose(handle);
#endif
}

static void print_dl_error(const char *message) {
#if defined(_WIN32) || defined(_WIN64)
    fprintf(stderr, "%s with code '%lu'.\n", message, GetLastError());
#else
    fprintf(stderr, "%s with `%s`.\n", message, dlerror());
#endif
}

Part 1: Speech Transcription with Whisper

Step 4: Initialize the Whisper Model

whisper_context_params cparams = whisper_context_default_params();
whisper_context *ctx = whisper_init_from_file_with_params(WHISPER_MODEL_PATH, cparams);

Step 5. Load and Prepare Audio Data

Whisper requires audio in a specific format: 32-bit floating-point PCM samples. Use dr_wav to load a WAV file and convert it automatically to the correct format:

drwav wav;
drwav_init_file(&wav, AUDIO_FILE_PATH, nullptr);

std::vector<float> pcmf32(wav.totalPCMFrameCount * wav.channels);

drwav_read_pcm_frames_f32(&wav, wav.totalPCMFrameCount, pcmf32.data());
drwav_uninit(&wav);

You can use other audio file formats, as long you convert the audio to pcmf32 for Whisper.

Step 6: Generate the Transcript

Process the audio through Whisper using beam search:

whisper_full_params wparams = whisper_full_default_params(WHISPER_SAMPLING_BEAM_SEARCH);
whisper_full(ctx, wparams, pcmf32.data(), (int) pcmf32.size());

After processing, Whisper returns time-stamped segments, each containing transcribed text. Iterate through each segment to see the transcribed text:

int n_segments = whisper_full_n_segments(ctx);
for (int i = 0; i < n_segments; ++i) {
    const char *text = whisper_full_get_segment_text(ctx, i);
    printf("[Segment %d] %s\n", i, text);
}

These segments will later be matched with speaker labels.

Part 2: Speaker Diarization with Falcon

With the transcription step complete, let's add speaker labels.

Step 7: Load Dynamic Library

7a. Open the Shared Library

Load the Falcon Speaker Diarization library:

void *dl_handle = open_dl(PV_FALCON_LIBRARY_PATH);

7b. Load Required Functions

Load the Falcon API functions from the shared library by retrieving their addresses and storing them as function pointers:

typedef const char *(*pv_status_to_string_t)(pv_status_t);
pv_status_to_string_t pv_status_to_string_func =
    reinterpret_cast<pv_status_to_string_t>(load_symbol(dl_handle, "pv_status_to_string"));

typedef pv_status_t (*pv_falcon_init_t)(const char *, const char *, const char *, pv_falcon_t **);
pv_falcon_init_t pv_falcon_init_func =
    reinterpret_cast<pv_falcon_init_t>(load_symbol(dl_handle, "pv_falcon_init"));

typedef void (*pv_falcon_delete_t)(pv_falcon_t *);
pv_falcon_delete_t pv_falcon_delete_func =
    reinterpret_cast<pv_falcon_delete_t>(load_symbol(dl_handle, "pv_falcon_delete"));

typedef pv_status_t (*pv_falcon_process_file_t)(
    pv_falcon_t *,
    const char *,
    int32_t *,
    pv_segment_t **);
pv_falcon_process_file_t pv_falcon_process_file_func =
    reinterpret_cast<pv_falcon_process_file_t>(load_symbol(dl_handle, "pv_falcon_process_file"));

typedef pv_status_t (*pv_falcon_segments_delete_t)(pv_segment_t *);
pv_falcon_segments_delete_t pv_falcon_segments_delete_func =
    reinterpret_cast<pv_falcon_segments_delete_t>(load_symbol(dl_handle, "pv_falcon_segments_delete"));

Each function is loaded in two steps:

use typedef to define a function pointer type
use load_symbol to find the function by name in the library and cast it to the correct type

These pointers can then be called like regular functions throughout your code.

Step 8: Initialize Falcon Speaker Diarization

Create a Falcon instance using your AccessKey and model file (falcon_params.pv).

If you haven't already, sign up for a free account on Picovoice Console and copy your AccessKey from the dashboard.

pv_falcon_t *falcon = nullptr;
const char *device = "best"; // Update as needed
pv_status_t status = pv_falcon_init(PV_ACCESS_KEY, PV_FALCON_MODEL_PATH, device, &falcon);

The device parameter lets you choose what hardware the engine runs on.

You can set it to 'best' to automatically pick the most suitable option, or specify a device yourself. For example, 'gpu' uses the first available GPU, while 'gpu:0' or 'gpu:1' targets a specific GPU. If you want to run on the CPU, use 'cpu', or control the number of CPU threads with something like 'cpu:4'.

See available devices (optional)

To see the list of available hardware devices, use pv_falcon_list_hardware_devices:

// Load functions
typedef pv_status_t (*pv_falcon_list_hardware_devices_t)(char ***, int32_t *);
pv_falcon_list_hardware_devices_t pv_falcon_list_hardware_devices_func =
    reinterpret_cast<pv_falcon_list_hardware_devices_t>(load_symbol(dl_handle, "pv_falcon_list_hardware_devices"));

typedef pv_status_t (*pv_falcon_free_hardware_devices_t)(char **, int32_t); 
pv_falcon_free_hardware_devices_t pv_falcon_free_hardware_devices_func =
    reinterpret_cast<pv_falcon_free_hardware_devices_t>(load_symbol(dl_handle, "pv_falcon_free_hardware_devices"));

// Get and print devices
char **hardware_devices = NULL;
int32_t num_hardware_devices = 0;
status = pv_falcon_list_hardware_devices_func(&hardware_devices, &num_hardware_devices);

for (int32_t i = 0; i < num_hardware_devices; i++) {
    fprintf(stdout, "%s\n", hardware_devices[i]);
}

// Free memory
pv_falcon_free_hardware_devices_func(hardware_devices, num_hardware_devices);

Step 9: Label Speakers in the Audio

Process the audio file through Falcon. Falcon returns time-stamped segments, each labeled with a unique speaker ID.

int32_t num_segments_falcon = 0;
pv_segment_t *segments = nullptr;

status = pv_falcon_process_file(falcon, AUDIO_FILE_PATH, &num_segments_falcon, &segments);

Each Falcon segment contains:

start_sec: When the speaker starts talking
end_sec: When the speaker stops talking
speaker_tag: A unique identifier for this speaker (0, 1, 2, etc.)

Part 3. Merge Transcript with Speaker Labels

Step 10. Match Speakers to Transcript Segments

Match each transcript segment with the corresponding speaker by comparing timestamps. When a Whisper segment overlaps with a Falcon segment, assign that speaker tag to the transcribed text.

int num_segments_whisper = whisper_full_n_segments(ctx);
for (int i = 0; i < num_segments_whisper; ++i) {
    const char *text = whisper_full_get_segment_text(ctx, i);
    float seg_start_sec = (float) whisper_full_get_segment_t0(ctx, i) * 0.01f;
    float seg_end_sec = (float) whisper_full_get_segment_t1(ctx, i) * 0.01f;

    // Find overlapping Falcon segment
    for (int k = 0; k < num_segments_falcon; ++k) {
        if (!(segments[k].end_sec < seg_start_sec || segments[k].start_sec > seg_end_sec)) {
            printf(
                "[Segment %d] Speaker %d -> Start: %5.2f, End: %5.2f:%s\n",
                i,
                segments[k].speaker_tag,
                segments[k].start_sec,
                segments[k].end_sec,
                text);
            break;
        }
    }
}

Cleanup Resources

When done, clean up resources to free memory:

pv_falcon_segments_delete(segments);
pv_falcon_delete(falcon);
whisper_free(ctx);

Complete Code: Audio Transcription with Speaker Labels in C++

Here's the complete whisper_falcon.cpp file bringing all the steps together:

whisper_falcon.cpp

#if defined(_WIN32) || defined(_WIN64)
#include <windows.h>
#else
#include <dlfcn.h>
#endif

#include <cstdio>
#include <vector>
#include <cstring>

#include "whisper.h"
#include "pv_falcon.h"

#define DR_WAV_IMPLEMENTATION
#include "dr_wav.h"

// Update these as needed
#define AUDIO_FILE_PATH "./multiple_speakers.wav"
#define WHISPER_MODEL_PATH "./whisper.cpp/models/ggml-base.en.bin"
#define PV_FALCON_MODEL_PATH "./falcon/falcon_params.pv"
#define PV_FALCON_LIBRARY_PATH "./falcon/libpv_falcon.{so|dylib|dll}"
#define PV_ACCESS_KEY "YOUR_ACCESS_KEY_HERE"

static void *open_dl(const char *dl_path) {
#if defined(_WIN32) || defined(_WIN64)
    return LoadLibrary((LPCWSTR) dl_path);
#else
    return dlopen(dl_path, RTLD_NOW);
#endif
}

static void *load_symbol(void *handle, const char *symbol) {
#if defined(_WIN32) || defined(_WIN64)
    return (void *)(uintptr_t)GetProcAddress((HMODULE)handle, symbol);
#else
    return dlsym(handle, symbol);
#endif
}

static void close_dl(void *handle) {
#if defined(_WIN32) || defined(_WIN64)
    FreeLibrary((HMODULE)handle);
#else
    dlclose(handle);
#endif
}

static void print_dl_error(const char *message) {
#if defined(_WIN32) || defined(_WIN64)
    fprintf(stderr, "%s with code '%lu'.\n", message, GetLastError());
#else
    fprintf(stderr, "%s with `%s`.\n", message, dlerror());
#endif
}

int main() {
    // --- Init Whisper ---
    whisper_context_params cparams = whisper_context_default_params();
    whisper_context *ctx = whisper_init_from_file_with_params(WHISPER_MODEL_PATH, cparams);
    if (!ctx) {
        fprintf(stderr, "Failed to load Whisper model\n");
        return 1;
    }

    drwav wav;
    if (!drwav_init_file(&wav, AUDIO_FILE_PATH, nullptr)) {
        fprintf(stderr, "Failed to open WAV file\n");
        return 1;
    }

    // Allocate space for float samples
    std::vector<float> pcmf32(wav.totalPCMFrameCount * wav.channels);

    // Read WAV as float
    drwav_read_pcm_frames_f32(&wav, wav.totalPCMFrameCount, pcmf32.data());
    drwav_uninit(&wav);

    whisper_full_params wparams = whisper_full_default_params(WHISPER_SAMPLING_BEAM_SEARCH);

    if (whisper_full(ctx, wparams, pcmf32.data(), (int) pcmf32.size()) != 0) {
        fprintf(stderr, "whisper_full() failed\n");
        whisper_free(ctx);
        return 1;
    }

    // --- Init Falcon ---
    void *dl_handle = open_dl(PV_FALCON_LIBRARY_PATH);
    if (!dl_handle) {
        fprintf(stderr, "failed to load library at `%s`.\n", PV_FALCON_LIBRARY_PATH);
        exit(1);
    }

    typedef const char *(*pv_status_to_string_t)(pv_status_t);
    pv_status_to_string_t pv_status_to_string_func =
        reinterpret_cast<pv_status_to_string_t>(load_symbol(dl_handle, "pv_status_to_string"));
    if (!pv_status_to_string_func) {
        print_dl_error("failed to load `pv_status_to_string`");
        exit(1);
    }

    typedef pv_status_t (*pv_falcon_init_t)(const char *, const char *, const char *, pv_falcon_t **);
    pv_falcon_init_t pv_falcon_init_func =
        reinterpret_cast<pv_falcon_init_t>(load_symbol(dl_handle, "pv_falcon_init"));
    if (!pv_falcon_init_func) {
        print_dl_error("failed to load `pv_falcon_init`");
        exit(1);
    }

    typedef void (*pv_falcon_delete_t)(pv_falcon_t *);
    pv_falcon_delete_t pv_falcon_delete_func =
        reinterpret_cast<pv_falcon_delete_t>(load_symbol(dl_handle, "pv_falcon_delete"));
    if (!pv_falcon_delete_func) {
        print_dl_error("failed to load `pv_falcon_delete`");
        exit(1);
    }

    typedef pv_status_t (*pv_falcon_process_file_t)(
        pv_falcon_t *,
        const char *,
        int32_t *,
        pv_segment_t **);
    pv_falcon_process_file_t pv_falcon_process_file_func =
        reinterpret_cast<pv_falcon_process_file_t>(load_symbol(dl_handle, "pv_falcon_process_file"));
    if (!pv_falcon_process_file_func) {
        print_dl_error("failed to load `pv_falcon_process_file`");
        exit(1);
    }

    typedef pv_status_t (*pv_falcon_segments_delete_t)(pv_segment_t *);
    pv_falcon_segments_delete_t pv_falcon_segments_delete_func =
        reinterpret_cast<pv_falcon_segments_delete_t>(load_symbol(dl_handle, "pv_falcon_segments_delete"));
    if (!pv_falcon_segments_delete_func) {
        print_dl_error("failed to load `pv_falcon_segments_delete`");
        exit(1);
    }

    pv_falcon_t *falcon = nullptr;
    const char *device = "best"; // Update as needed
    pv_status_t status = pv_falcon_init_func(PV_ACCESS_KEY, PV_FALCON_MODEL_PATH, device, &falcon);
    if (status != PV_STATUS_SUCCESS) {
        fprintf(stderr, "Failed to init Falcon\n");
        whisper_free(ctx);
        return 1;
    }

    int32_t num_segments_falcon = 0;
    pv_segment_t *segments = nullptr;
    status = pv_falcon_process_file_func(falcon, AUDIO_FILE_PATH, &num_segments_falcon, &segments);
    if (status != PV_STATUS_SUCCESS) {
        fprintf(stderr, "Falcon processing failed\n");
        pv_falcon_delete_func(falcon);
        whisper_free(ctx);
        return 1;
    }

    printf("\nWhisper Speech-to-Text + Falcon Speaker Diarization:\n");
    int num_segments_whisper = whisper_full_n_segments(ctx);
    for (int i = 0; i < num_segments_whisper; ++i) {
        const char *text = whisper_full_get_segment_text(ctx, i);
        float seg_start_sec = (float) whisper_full_get_segment_t0(ctx, i) * 0.01f;
        float seg_end_sec = (float) whisper_full_get_segment_t1(ctx, i) * 0.01f;

        // Find overlapping Falcon segment
        for (int k = 0; k < num_segments_falcon; ++k) {
            if (!(segments[k].end_sec < seg_start_sec || segments[k].start_sec > seg_end_sec)) {
                printf(
                    "[Segment %d] Speaker %d -> Start: %5.2f, End: %5.2f:%s\n",
                    i,
                    segments[k].speaker_tag,
                    segments[k].start_sec,
                    segments[k].end_sec,
                    text);
                break;
            }
        }
    }

    pv_falcon_segments_delete_func(segments);
    pv_falcon_delete_func(falcon);
    whisper_free(ctx);
    close_dl(dl_handle);
    return 0;
}

Build & Run the Application

Before building, verify:

AUDIO_FILE_PATH points to a valid WAV file
WHISPER_MODEL_PATH matches your downloaded model location
PV_FALCON_MODEL_PATH points to falcon_params.pv
PV_FALCON_LIBRARY_PATH points to:
- Linux: x86_64/libpv_falcon.so
- macOS: x86_64/libpv_falcon.dylib or arm64/libpv_falcon.dylib
- Windows: amd64/libpv_falcon.dll or arm64/libpv_falcon.dll
PV_ACCESS_KEY contains your Picovoice access key

Build

In the project root containing CMakeLists.txt:

cmake -S . -B ./build
cmake --build ./build --config Release

Run

./build/whisper_falcon
# ./build/whisper_falcon.exe on Windows

Expected Output Example

[Segment 0] Speaker 1 -> Start:  0.00, End:  3.20: Hello, how are you today?
[Segment 1] Speaker 2 -> Start:  3.20, End:  5.80: I'm doing great, thanks for asking!
[Segment 2] Speaker 1 -> Start:  5.80, End:  8.40: That's wonderful to hear.

Troubleshooting Common Issues

whisper_init_from_file_with_params_no_state: failed to open

Run the Whisper model download script: sh ./models/download-ggml-model.sh base.en
Verify the model exists at: whisper.cpp/models/ggml-base.en.bin
Update WHISPER_MODEL_PATH path to match your actual model location

Failed to init Falcon

Confirm falcon_params.pv is in the falcon/ directory
Check that PV_FALCON_MODEL_PATH points to the correct location
Download the model file from: Falcon model on GitHub

Failed to open WAV file

Verify the WAV file exists at the path specified in AUDIO_FILE_PATH
Ensure the audio file is a valid WAV format (Falcon supports other formats, but this example expects WAV)

Application runs but produces no output

Ensure your audio file contains speech (test with a known recording)
Check that the audio is not corrupted by playing it in a media player

Frequently Asked Questions

What audio formats can I use with whisper.cpp and Falcon Speaker Diarization?

whisper.cpp requires audio in PCM float format, which this tutorial handles through dr_wav for WAV files. Falcon itself supports many formats including 3gp (AMR), FLAC, MP3, MP4/m4a (AAC), Ogg, WAV, and WebM. To use other formats, you'd need to add an appropriate audio decoder (like FFmpeg) to convert audio to PCM float for Whisper.

What's the maximum number of speakers Falcon Speaker Diarization can handle?

Falcon can diarize unlimited speakers in a conversation. It automatically assigns unique speaker tags (0, 1, 2, etc.) as it detects different voices.

Can I use my Whisper CPP Speaker Diarization app for all languages?

Yes. Use the multilingual Whisper models (without .en suffix) like base, small, or large. Falcon performs speaker diarization independent of language—it analyzes voice characteristics, not speech content.

Can I add Speaker Diarization to Whisper using Python?

Yes. Picovoice provides Python bindings for Falcon, and OpenAI Whisper has an official Python package. Check out our guide Adding Speaker Diarization to OpenAI Whisper in Python for guidance.