OpenAI Whisper Speech-to-Text is a locally executable speech recognition model that comes in various sizes, allowing users to choose a model that suits their device's specifications. Unfortunately, Whisper lacks speaker diarization, a crucial feature for applications that require speaker identification (e.g. discerning speakers in a meeting scenario).

This article guides you through the process of integrating Picovoice Falcon Speaker Diarization with OpenAI Whisper in Python. Adding speaker diarization will result in a more user-friendly, dialogue-style transcription.


Start by installing the necessary packages:

Both Falcon Speaker Diarization and Whisper Speech-to-Text run on CPU and do not require a GPU. While Whisper may be slow on CPU, utilizing a GPU can improve its runtime.

Speech Recognition with Whisper

Let's begin by utilizing Whisper for speech recognition. The code snippet below demonstrates how to transcribe speech using Whisper:

Here, ${WHISPER_MODEL} refers to one of the available Whisper models, and ${AUDIO_FILE_PATH} is the path to the audio file. Since our goal is a dialogue-style transcription, we'll focus on extracting segments from the result, each representing a part of the transcript with its corresponding timestamp.

Speaker Diarization with Falcon

Next, let's perform speaker diarization using Falcon. The following code snippet illustrates how to apply Falcon for this purpose:

Here, ${ACCESS_KEY} is your access key obtained from the Picovoice Console. The process method result is a list of speaker segments, similar to Whisper's segments but with speaker_tag fields indicating the speaker.

Integrating Whisper and Falcon Speaker Diarization

By combining OpenAI Whisper for speech recognition and Picovoice Falcon Speaker Diarization for speaker diarization, we aim to create a dialogue-style transcription. To achieve this, we'll define a simple score to measure the overlap between Whisper and Falcon Speaker Diarization segments. The following code snippet demonstrates how to calculate this score:

Utilizing this score, we can find the best-matching Falcon Speaker Diarization segment for each Whisper segment. The code snippet below demonstrates this process:

This is a basic approach for merging the two segment lists, intended for demonstration purposes. Results can be further enhanced with a more sophisticated matching algorithm.

Putting everything together would result in the script below:

And the expected result follows a format similar to the below output:

For more in-depth information on the Falcon Speaker Diarization Python SDK, delve into the documentation. For those seeking a seamless solution that effortlessly combines speech recognition and speaker diarization, consider exploring Picovoice Leopard Speech-to-Text. Leopard Speech-to-Text, recognized for its lightweight and fast performance, internally incorporates Falcon Speaker Diarization, resulting in optimized outcomes. It streamlines the transcription process, enabling you to effortlessly obtain speaker information through a single function call.

Start Building