OpenAI Whisper Speech-to-Text is a locally executable speech recognition model that comes in various sizes, allowing
users to choose a model that suits their device's specifications. Unfortunately, Whisper lacks speaker diarization, a crucial feature
for applications that require speaker identification (e.g. discerning speakers in a meeting scenario).
This article guides you through the process of integrating Picovoice Falcon Speaker Diarization with OpenAI Whisper in Python. Adding speaker diarization will result in a more user-friendly, dialogue-style transcription.
🚀 Best-in-class Voice AI!
Enable machines and humans to read and analyze transcripts without sacrificing privacy with Falcon Speaker Diarization.
Both Falcon Speaker Diarization and Whisper Speech-to-Text run on CPU and do not require a GPU. While Whisper may be slow on CPU, utilizing a GPU can improve its runtime.
Speech Recognition with Whisper
Let's begin by utilizing Whisper for speech recognition. The code snippet below demonstrates how to transcribe speech
using Whisper:
import whisper
model = whisper.load_model({$WHISPER_MODEL})
result = model.transcribe({$AUDIO_FILE_PATH})
transcript_segments = result["segments"]
Here, ${WHISPER_MODEL} refers to one of
the available Whisper models,
and ${AUDIO_FILE_PATH} is the path to the audio file. Since our goal is a dialogue-style transcription, we'll focus on
extracting segments from the result, each representing a part of the transcript with its corresponding timestamp.
Speaker Diarization with Falcon
Next, let's perform speaker diarization using Falcon. The following code snippet illustrates how to apply Falcon for
this purpose:
Here, ${ACCESS_KEY} is your access key obtained from the Picovoice Console.
The process method result is a list of speaker segments, similar to Whisper's segments but with speaker_tag fields
indicating the speaker.
Integrating Whisper and Falcon Speaker Diarization
By combining OpenAI Whisper for speech recognition and Picovoice Falcon Speaker Diarization for speaker diarization, we aim to create a dialogue-style
transcription. To achieve this, we'll define a simple score to measure the overlap between Whisper and Falcon Speaker Diarization segments.
The following code snippet demonstrates how to calculate this score:
Utilizing this score, we can find the best-matching Falcon Speaker Diarization segment for each Whisper segment. The code snippet below
demonstrates this process:
This is a basic approach for merging the two segment lists, intended for demonstration purposes. Results can be further enhanced with a more sophisticated matching algorithm.
Putting everything together would result in the script below:
And the expected result follows a format similar to the below output:
Speaker 1: Hey, has the task been completed?
Speaker 2: I don't know anything about it.
Speaker 3: Well, we're in the process of working on it.
Speaker 3: There's a bit of a delay because we're waiting on someone else to complete their part.
Speaker 1: Waiting again? This is taking longer than expected.
Speaker 1: Can we get an update on the timeline?
Speaker 3: I understand the urgency.
Speaker 3: I've followed up with the person responsible, and they've assured me they're working on it.
Speaker 3: We should have a clearer timeline by the end of the day.
It only takes a minute to add speaker diarization to Whisper using Falcon:
For more in-depth information on the Falcon Speaker Diarization Python SDK, delve into
the documentation. For those seeking a seamless
solution that effortlessly combines speech recognition and speaker diarization, consider
exploring Picovoice Leopard Speech-to-Text. Leopard Speech-to-Text, recognized for
its lightweight and fast performance,
internally incorporates Falcon Speaker Diarization, resulting in optimized
outcomes. It streamlines the transcription process, enabling you
to effortlessly obtain speaker information through a single function call.