On-device speaker diarization, enabling machines and humans to read and analyze transcripts without sacrificing privacy
Falcon Speaker Diarization identifies speakers in an audio stream by finding speaker change points and grouping speech segments based on speaker voice characteristics.
Powered by deep learning, Falcon Speaker Diarization enables machines and humans to read and analyze conversation transcripts created by Speech-to-Text APIs or SDKs.
f = pvfalcon.create(access_key)segments = f.process_file(path)
Speaker Diarization often works with specific Speech-to-Text APIs or runs on certain platforms, limiting options for developers.
Falcon Speaker Diarization is the only modular and cross-platform Speaker Diarization software that works with any Speech-to-Text engine. Falcon Speaker Diarization processes speech data locally without sending it to remote servers, respecting privacy.
Identify speakers in conversations
Most speech-to-text APIs position Speaker Diarization as a feature and do not even mention its accuracy. Picovoice published an Open-source Speaker Diarization Benchmark to enable informed decisions.
Most Speaker Diarization solutions only work with the transcription engines that are integrated. Falcon Speaker Diarization runs with any transcription engine, including OpenAI Whisper, Google Speech-to-Text, Amazon Transcribe, Azure Speech-to-Text, and IBM Watson Speech-to-Text, giving developers flexibility.
Falcon Speaker Diarization requires much less than the alternatives to achieve SOTA (state-of-the-art). Utilize existing hardware, minimize compute costs, and save the environment!
Speaker Diarization deals with identifying “who spoke when”. Speaker Diarization splits an audio stream that contains human speech into homogeneous segments using speaker voice characteristics, then associates each with individual speakers.
Speaker Diarization consists of two main steps: speaker segmentation and speaker clustering. Speaker segmentation focuses on finding speaker change points in an audio stream. Clustering groups speech segments together based on speakers’ voice characteristics.
Speech-to-Text deals with “what is said.” It converts speech into text without distinguishing speakers, i.e., “who?”. Speech-to-text with timestamps also includes timing information, i.e., “when”.
Speaker Diarization differentiates speakers, answering “who spoke, when” without analyzing “what’s said.” Thus, developers use Speech-to-Text and Speaker Diarization together to identify “who said what and when.”
In short, Speaker Diarization and Speech-to-Text are complementary speech-processing technologies. Speaker Diarization enhances the Speech-to-Text transcripts for conversations where multiple speakers are involved. The transcription result tags each word with a number assigned to individual speakers. A transcription result can include numbers up to as many speakers as Speaker Diarization can uniquely identify in the audio sample.
Leopard Speech-to-Text and Cheetah Streaming Speech-to-Text are Picovoice’s Speech-to-Text engines. Leopard Speech-to-Text is ideal for batch audio transcription, while Cheetah Streaming Speech-to-Text is for real-time transcription.