Speaker Recognition answers the question of “who is speaking?”. Speaker Verification is a subfield of Speaker Recognition for authentication. Speaker Verification is also known as Speaker Authentication, Voice Biometrics, and Voice ID. Different from verification where the target speaker is known, Speaker Identification pinpoints the spoken utterance to one of the (possibly many) known speakers.

The first use of Speaker Verification and Speaker Identification is security and authentication. When one calls a service provider, they can verify your identity if they have your voiceprint. This identification is passive and accomplished using a Speaker Identification system. Alternatively, Speaker Verification is used to unlock protected applications and devices actively. The user needs to enroll first by providing a few examples of their voice uttering either a fixed text (for text-dependent Speaker Verification) or an arbitrary sentence (for text-independent Speaker Verification). Speaker Identification has the potential to personalize voice user interfaces (VUI). e.g. when you ask your smart speaker to play your favourite album, you should get a different result than when your child does.

Below we look into options available to add Speaker Recognition in 2022.

There is not much available out there out-of-the-box. Does it mean there is a disruption opportunity here for newcomers?

Azure Speaker Recognition API

It offers both verification and identification. The Speaker Verification is $5 per 1000 transactions, while Speaker Identification is $10. The pricing is transparent, but access is not open. You need to go through an approval process.

Open-Source

There is no ready-to-deploy open source project that you can use, unlike ASR, where there are open source projects such as Kaldi or Vosk. But implementations of widely known papers in the field with a free dataset are available. These are not production-quality but useful as a starting point.