Speaker Diarization in Python

🚀 Best-in-class Voice AI!

Build compliant and low-latency AI apps using Python without sending user data to 3rd party servers.

Speaker diarization is the process of dividing an audio stream into distinct segments based on speaker identity. In simpler terms, it answers the question, "Who spoke when?"

Previously, we introduced you to some of the Top Speaker Diarization APIs and SDKs currently available in the market. In this article, we'll dive into practical demonstrations of three Python-based speaker diarization frameworks, showcasing their capabilities through a straightforward speaker diarization task.

pyannote.audio

Getting started with pyannote.audio for speaker diarization is straightforward. Follow these steps:

Install the pyannote.audio package using pip:

pip3 install pyannote.audio

Obtain your authentication token to download pretrained models by visiting their Hugging Face pages.
Use the following Python code to perform speaker diarization on an audio file:

from pyannote.audio import Pipeline

# Replace "${ACCESS_TOKEN_GOES_HERE}" with your authentication token
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization",
    use_auth_token="${ACCESS_TOKEN_GOES_HERE}")

# Replace "${AUDIO_FILE_PATH}" with the path to your audio file
diarization = pipeline("${AUDIO_FILE_PATH}")

for segment, _, speaker in diarization.itertracks(yield_label=True):
    print(f'Speaker "{speaker}" - "{segment}"')

This code will perform speaker diarization and print out the identified speakers along with their corresponding segments in the audio file.

NVIDIA NeMo

To perform speaker diarization using NVIDIA NeMo, follow these steps:

Install dependencies:

apt-get update && apt-get install -y libsndfile1 ffmpeg
pip3 install Cython

Install NeMo:

pip install git+https://github.com/NVIDIA/[email protected]#egg=nemo_toolkit[all]

Download the config file for the inference from the NeMo GitHub repository.
Generate and store the manifest file by running the following code:

import json
import os

from nemo.collections.asr.models import ClusteringDiarizer
from omegaconf import OmegaConf

INPUT_FILE = '/PATH/TO/AUDIO_FILE.wav'
MANIFEST_FILE = '/PATH/TO/MANIFEST_FILE.json'

meta = {
    'audio_filepath': input_file,
    'offset': 0,
    'duration': None,
    'label': 'infer',
    'text': '-',
    'num_speakers': None,
    'rttm_filepath': None,
    'uem_filepath': None
}
with open(MANIFEST_FILE, 'w') as fp:
    json.dump(meta, fp)
    fp.write('\n')

Replace /PATH/TO/AUDIO_FILE.wav with the path to your audio file and /PATH/TO/MANIFEST_FILE.json with the desired path for your manifest file.

Load the config file and define a ClusteringDiarizer object:

OUTPUT_DIR = '/PATH/TO/OUTPUT_DIR'
MODEL_CONFIG = '/PATH/TO/CONFIG_FILE.yaml'

config = OmegaConf.load(MODEL_CONFIG)
config.diarizer.manifest_filepath = MANIFEST_FILE
config.diarizer.out_dir = OUTPUT_DIR
config.diarizer.oracle_vad = False
config.diarizer.clustering.parameters.oracle_num_speakers = False

sd_model = ClusteringDiarizer(cfg=config)

Replace /PATH/TO/OUTPUT_DIR and /PATH/TO/CONFIG_FILE.yaml with the desired paths for your output directory and config file, respectively.

Perform speaker diarization on the audio file:

sd_model.diarize()

The output of the speaker diarization will be stored in the OUTPUT_DIR directory as a Rich Transcription Time Marked (RTTM) file.

Simple Diarizer

Simple Diarizer is a speaker diarization library that utilizes pretrained models from SpeechBrain. To get started with simple_diarizer, follow these steps:

Install the package using pip:

pip install simple_diarizer

Define a Diarizer object:

from simple_diarizer.diarizer import Diarizer

diarization = Diarizer(embed_model='xvec', cluster_method='sc')

Perform speaker diarization on an audio file by either passing the number of speakers:

# Replace "${AUDIO_FILE_PATH}" with the path to your audio file
segments = diarization.diarize("${AUDIO_FILE_PATH}", num_speakers=NUM_SPEAKERS)

Or by passing a threshold value:

segments = diarization.diarize("${AUDIO_FILE_PATH}", threshold=THRESHOLD)

The speaker information and timing details, including the start and end times of each segment, are stored in the segment variable.

Falcon Speaker Diarization

Falcon Speaker Diarization is an on-device speaker diarization engine powered by deep learning. To get started with Falcon Speaker Diarization, follow these steps:

Install the package using pip:

pip install pvfalcon

Sign up for Picovoice Console for free and copy your AccessKey. It handles authentication and authorization.
Create an instance of the engine:

import pvfalcon

# Replace "${ACCESS_KEY}" with your Picovoice Console AccessKey
falcon = pvfalcon.create(access_key="${ACCESS_KEY}")

Perform speaker diarization on an audio file:

# Replace "${AUDIO_FILE_PATH}" with the path to your audio file
segments = falcon.process_file("${AUDIO_FILE_PATH}")
for segment in segments:
    print(
        "{speaker_tag=%d start_sec=%.2f end_sec=%.2f}"
        % (segment.speaker_tag, segment.start_sec, segment.end_sec)
    )

The segments variable represents an array of segments, each of which includes the segment's timing and speaker information.

For more information about Falcon Speaker Diarization, check out the Falcon Speaker Diarization product page or refer to the Falcon Speaker Diarization Python SDK quick start guide. You can analyze calls, transcribe podcasts, identify speakers across meeting recordings, and more!

Start Building

Speaker Diarization in Python

pyannote.audio

NVIDIA NeMo

Simple Diarizer

Falcon Speaker Diarization

YouTube Tutorial

More from Picovoice