Falcon Speaker Diarization

Accurate and lightweight diarization to find who spoke when in Whisper transcripts

Only modular on-device speaker diarization SDK. Add accurate speaker labels to any STT transcript — Whisper, Amazon, Google, Azure, and more. 221× more efficient than pyannote with near-identical accuracy.

Press the button
to start diarizing with Falcon
20%
Jaccard Error Rate vs. pyannote's 27%
117 MB
Total Memory Usage vs. pyannote's 1.5 GB
0.02
Core-hour ratio vs. pyannote's 4.42
What is Falcon Speaker Diarization?

Only modular on-device speaker diarization SDK

Falcon Speaker Diarization is an enterprise-ready on-device speaker diarization engine built for batch audio processing at scale. It identifies who spoke when in completed recordings, runs entirely offline across platforms, and is private by architecture.

Most cloud STT APIs embed speaker diarization inside their own transcription pipeline and treat it as a feature. Some STTs, like Whisper, don't have speaker diarization. Falcon is decoupled from transcription entirely. It works alongside any STT engine, Whisper, Amazon Transcribe, Google Speech-to-Text, Azure Speech, or any other, and outputs timestamped speaker segments that combine with the transcript in your pipeline. No cloud round-trip. No audio transmitted to any server.

For teams considering pyannote, Falcon Speaker Diarization delivers a higher Jaccard Error Rate and near-identical DER accuracy at 221× lower compute cost and 15× less memory, making it viable on any hardware rather than requiring dedicated infrastructure.

Developer Experience

Add speaker labels to any transcript in minutes

Falcon Speaker Diarization takes an audio file and returns an array of timestamped speaker segments. Each segment contains a start time, end time, and speaker tag. Use Falcon Speaker Diarization with its native SDKs for Python, C, iOS, Android, and Web.

OPEN-SOURCE FALCON SPEAKER DIARIZATION BENCHMARK

Proven accuracy and efficiency. Open-source reproducible.

Falcon Speaker Diarization is benchmarked against Amazon Transcribe, Azure Speech-to-Text, Google Speech-to-Text, and pyannote on the VoxConverse dataset. Falcon Speaker Diarization outperforms every cloud STT API on both DER and JER, and matches pyannote's accuracy at 221× lower compute cost and 15× less memory.

Jaccard Error Rate (Accuracy)
lower is better
Falcon20%
pyannote27%
Amazon Transcribe30%
Azure Speech30%
Google Enhanced58%
Diarization Error Rate (Accuracy)
lower is better
pyannote9%
Falcon10%
Amazon Transcribe11%
Azure Speech16%
Google Enhanced24%
Diarization Memory Usage
lower is better
Falcon116.8 MB
pyannote1.5 GB
Core Hour Ratior
lower is better
Falcon0.02x
pyannote4.42x
Ready to integrate? Check our docs to start building or talk to the sales team about enterprise deployment.
Capabilities

Why enterprises choose Falcon Speaker Diarization

Falcon Speaker Diarization is an enterprise-ready on-device speaker diarization engine built for batch audio processing at scale. It processes completed recordings offline, runs across platforms without GPU, and is private by architecture.

01STT-agnosticCloud diarization APIs bundle speaker identification into their own STT pipeline — use their engine or nothing. Falcon is decoupled from transcription entirely. It works alongside any STT — Whisper, Amazon Transcribe, Google Speech-to-Text, Azure Speech, IBM Watson, or any other. Speaker segments arrive separately and combine with the transcript in your pipeline.
02AccurateCloud STT APIs treat speaker diarization as a feature bundled into transcription — not a product they invest in independently. You cannot use their diarization with a different STT engine, accuracy figures are rarely published, and audio must leave the device on every inference. Falcon Speaker Diarization is decoupled from transcription entirely, works with any STT engine, including Whisper, and outperforms Amazon, Azure, and Google on Jaccard Error Rate and Diarization Error Rate.
Jaccard Error Rate (Accuracy)
lower is better
Falcon20%
pyannote27%
Amazon Transcribe30%
Azure Speech30%
Google Enhanced58%
Diarization Error Rate (Accuracy)
lower is better
pyannote9%
Falcon10%
Amazon Transcribe11%
Azure Speech16%
Google Enhanced24%
03EfficientOn AMD Ryzen 7 5700X, processing 100 hours of audio requires 2 core-hours with Falcon Speaker Diarization versus 442 core-hours with pyannote, a 221× difference. This 221x gap, along with Falcon's peak memory consumption of 0.1 GB (vs. pyannote's 1.5 GB), makes Falcon a great fit for any platform: embedded, mobile, web browsers, laptops, and servers. No GPU or dedicated AI accelerator required.
04No speaker limitUnlike cloud STT APIs, which require the number of speakers to be specified in advance or impose hard caps, Falcon Speaker Diarization automatically detects and labels speakers without prior knowledge of speaker count. No limit on the number of speakers it can identify.
05Cross-PlatformFalcon Speaker Diarization runs on every platform your product ships — Android, Chrome, Edge, Firefox, iOS, Linux, macOS, Raspberry Pi, Safari, and Windows — across AMD, Intel, NVIDIA, and Qualcomm hardware.
06Language AgnosticUnlike cloud STT APIs, where diarization is limited to languages supported by the bundled transcription engine, Falcon's speaker identification is language-agnostic. It correctly identifies and separates speakers regardless of language and performs correctly even when speakers switch languages mid-conversation.
07Private by designFalcon Speaker Diarization processes audio entirely on-device without transmitting audio data to any server, making Falcon Speaker Diarization GDPR, HIPAA, CCPA, and CJIS compliant by architecture — not policy.
08Enterprise ReadyFalcon Speaker Diarization is production-grade and enterprise-ready. Picovoice offers flexible licensing, dedicated engineering support, NDA-protected custom model training, and SLA-backed response times for teams shipping at scale.

Ship it.
On device.

Speaker diarization that works with any STT. On-device. No GPU.

FAQ

Common questions about speaker diarization

+
What is Speaker Diarization?

Speaker Diarization deals with identifying "who spoke when". Speaker Diarization splits an audio stream that contains human speech into homogeneous segments using speaker voice characteristics, then associates each with individual speakers.

+
What are the steps in Speaker Diarization?

Speaker Diarization consists of two main steps: speaker segmentation and speaker clustering. Speaker segmentation focuses on finding speaker change points in an audio stream. Clustering groups speech segments together based on speakers' voice characteristics.

+
How does Speaker Diarization differ from Speech-to-Text?

Speech-to-Text deals with "what is said." It converts speech into text without distinguishing speakers, i.e., "who?". Speech-to-text with timestamps also includes timing information, i.e., "when".

Speaker Diarization differentiates speakers, answering "who spoke, when" without analyzing "what's said." Thus, developers use Speech-to-Text and Speaker Diarization together to identify "who said what and when."

In short, Speaker Diarization and Speech-to-Text are complementary speech-processing technologies. Speaker Diarization enhances the Speech-to-Text transcripts for conversations where multiple speakers are involved. The transcription result tags each word with a number assigned to individual speakers. A transcription result can include numbers up to as many speakers as Speaker Diarization can uniquely identify in the audio sample.

Leopard Speech-to-Text and Cheetah Streaming Speech-to-Text are Picovoice's Speech-to-Text engines. Leopard Speech-to-Text is ideal for batch audio transcription, while Cheetah Streaming Speech-to-Text is for real-time transcription.

+
How does Speaker Diarization differ from Speaker Recognition?

Speaker Diarization and Speaker Recognition are similar but different technologies enabling different use cases. Both identify speakers by analyzing the voice characteristics of speakers. Speaker recognition identifies "known" speakers, whereas Speaker Diarization differentiates speakers without knowing who they are. Speaker Recognition returns recorded names of the enrolled speakers, such as Jane and Joe. Speaker Recognition cannot identify speakers without enrolled voice prints. Speaker Diarization, on the other hand, returns labels such as Speaker 1 and Speaker 2 without requiring speakers' voice prints. Speaker Diarization does not transfer information between audio files, meaning a speaker can be Speaker 1 in one file and Speaker 2 in another.

In short, Speaker Recognition can verify speakers, whereas Speaker Diarization does not match voice characteristics to verify speakers. Check out Eagle Speaker Recognition and its web demo to learn more about speaker recognition.

+
What can I build with Speaker Diarization?

Enterprises across industries use speaker diarization to make audio data structured, searchable, and actionable. Call centers use it to automatically label agent and customer turns for QA scoring, coaching workflows, and compliance monitoring. Legal teams use it to separate parties in depositions, hearings, and recorded interviews for evidence discovery. Healthcare practices use it to distinguish clinician and patient speech in consultation recordings for reporting and EHR integration. Media and podcast teams use it to attribute speech to named guests for searchable transcripts and accessibility captions. Enterprise meeting intelligence platforms use it to assign action items, summaries, and sentiment scores to the right speaker in recorded calls and video conferences.

+
What is engine-agnostic Speaker Diarization?

Most vendors offer Speaker Diarization embedded into their Speech-to-Text software as developers use Speaker Diarization to identify speakers within a transcript provided by Speech-to-Text. Offering them jointly simplifies the development process. However, it limits developers to choose what works best for them. Engine-agnostic Speaker Diarization works with any Speech-to-Text software. Developers who are unsatisfied with the performance of embedded Speaker Diarization or those who prefer a Speech-to-Text software that doesn't offer embedded Speaker Diarization can use Falcon Speaker Diarization with Speech-to-Text of their choice.

+
Can I use Falcon Speaker Diarization with OpenAI Whisper Speech-to-Text?

Yes, you can use Falcon Speaker Diarization with OpenAI's Whisper Speech-to-Text or any other automatic speaker recognition engine, including but not limited to Amazon Transcribe, Google Speech-to-Text, and Microsoft Azure Speech-to-Text. Falcon processes the same audio file independently of the transcription engine and returns speaker-timestamped segments that you combine with the Whisper transcript in your pipeline.

Check out the tutorials for Adding Speaker Diarization to OpenAI Whisper using Picovoice Falcon and Adding Speaker Diarization to OpenAI Whisper in C++.

+
Does Falcon Speaker Diarization require knowing the number of speakers ahead of time?

No. Falcon automatically detects and labels speakers without any prior knowledge of speaker count and has no limit on the number it can identify, making it suitable for panel discussions, large meetings, and any recording where the speaker count is unknown or variable.

+
What's the maximum number of speakers Falcon Speaker Diarization supports?

There is no limit on the number of speakers that Falcon Speaker Diarization supports. In other words, Falcon Speaker Diarization works with an unlimited number of speakers.

+
Does Falcon Speaker Diarization support real-time Speaker Diarization?

Falcon Speaker Diarization is a batch engine — it processes completed audio files. For real-time streaming speaker diarization, see Bluebird Streaming Speaker Diarization, Picovoice's on-device streaming diarization SDK that assigns speaker labels to live audio streams in under 250 ms.

+
How does Falcon Speaker Diarization differ from speaker diarization in cloud STT APIs?

Cloud STT APIs — Amazon Transcribe, Azure Speech, Google Speech-to-Text — embed speaker diarization as a feature inside their own transcription pipeline. Accuracy figures for diarization are rarely published, and audio must leave the device on every inference. Falcon is decoupled from transcription entirely, works with any STT engine, publishes accuracy in an open-source benchmark, and processes all audio on-device with no audio transmitted to any server.

+
How does Falcon Speaker Diarization compare to pyannote?

pyannote, leading open-source speaker diarization library, achieves a marginally lower DER (9.0% vs. Falcon's 10.3%), but higher JER (27.4% vs. Falcon's 19.9%) and requires 221x more compute (442 core-hours to process 100 hours of audio, versus Falcon's 2 core-hours). pyannote's 1.5 GB of peak memory requirement versus Falcon's 0.1 GiB, makes it ineligible for embedded and mobile deployments.

For production deployments, the resource requirements make pyannote impractical on cost-sensitive hardware and costly for large-scale deployments. To process 1,000,000 hours of audio with pyannote on AWS EC2, an enterprise should spend $187,850/month just for compute cost on Amazon, while it'd be $850/month with Falcon.

In short, Falcon delivers near-identical accuracy with a fraction of the compute and memory footprint of pyannote and offers a production-grade SDK with enterprise support.

+
How accurate is Falcon Speaker Diarization?

Falcon achieves 19.9% JER and 10.3% DER on the VoxConverse dataset, proven by the open-source speaker diarization benchmark.

On JER, Falcon Speaker Diarization (19.9%) outperforms pyannote (27.4%), Amazon Transcribe Speaker Diarization (29.8%), Azure STT Speaker Diarization (30.1%), Google Enhanced STT Speaker Diarization (57.6%), and Google Cloud STT Speaker Diarization (83.4%).

On DER, Falcon Speaker Diarization (10.3%) outperforms Amazon Transcribe Speaker Diarization (11.1%), Azure STT Speaker Diarization (15.7%), Google Enhanced STT Speaker Diarization (24.0%), and Google Cloud STT Speaker Diarization (50.2%). pyannote edges Falcon on DER by 1.3 percentage points (9.0% vs. 10.3%) — the only metric where it leads — while requiring 221× more compute and 15× more memory.

+
Which platforms does Falcon Speaker Diarization support?
+
How do I get technical support for Falcon Speaker Diarization?

Picovoice docs, blog, Medium posts, and GitHub are great resources to learn about voice AI, Picovoice technology, and how to perform speaker diarization. Enterprise customers get dedicated support specific to their applications from Picovoice Product & Engineering teams. Reach out to your Picovoice contact or contact sales to discuss support options.

+
How can I get informed about updates and upgrades?

Version changes appear in the and LinkedIn. Subscribing to GitHub is the best way to get notified of patch releases. If you enjoy building with Falcon Speaker Diarization, show it by giving a GitHub star!