Speaker Diarization is figuring out “who spoke when?”. In practice, it’s about “who spoke when and what?”. Let me expand. In academia, Speaker Diarization is the problem of partitioning a stream of audio into segments spoken by single speakers. But Speaker Diarization is usually a subcomponent of speech-to-text systems. It increases readability. Speaker Recognition, Speaker Identification, and Speaker Clustering (can) refer to the same technology.

If you need Speaker Diarization, you have to make a strategic decision between build, open-source, and buy. This decision has long-term implications for the product, resource allocation, and budgeting.


Depending on who you talk to, Speaker Diarization is either a solved problem or impossible to solve.

There are two dominant paradigms for building a diarization system. One is based on speaker embeddings generated by a (possibly deep) model and then applying clustering on top to create partitions. Speaker Diarization with LSTM is an example of this method. Alternatively, one can redefine diarization as a classification problem and solve it end-to-end. End-To-End Speaker Diarization Conditioned on Speech Activity and Overlap Detection is an example of an end-to-end approach. Although the latter has the potential to outperform the clustering-based, it is much harder to train and requires significantly more labelled data.


pyannote.audio is an open-source Speaker Diarization project. Also, speech-to-text projects such as Kaldi or SpeechBrain have a diarization subcomponent embedded in their framework.

The upside of free and open-source software (FOSS) is that it is free. Right? FOSS is free to use, but its total cost of ownership is not zero. When it comes to FOSS, what you see is what you get. There is no dedicated support. You have no say in the roadmap, and maintainers have the freedom to abandon the project when they wish. FOSS can be a starting point for the build option rather than plug-and-play for long-term use.


There is no standalone Speaker Diarization commercial API. You need to use one of the speech-to-text (STT) APIs and enable Speaker Diarization. This option can cost extra on top of STT API cost. The downside of API-based offerings is that they are unbearably expensive as your business scales.

Azure Speech-to-Text, AWS Transcribe, Google Speech-to-Text, and IBM Watson Speech-to-Text offer Speaker Diarization. The cost is between $1 to $3 per hour. Also, there is a flow of SaaS startups in speech recognition. Speechmatics, AssemblyAI, DeepGram, and Rev are to name a few. They might have better technology, or you might be able to negotiate a better deal.