Speaker Recognition
engines focus on identifying the speaker rather than the content of their speech. Thus, people assume Speaker Recognition
engines are Text-Independent
. However, this is not always the case. Especially, models trained using the old-school methodologies are text-dependent because, technically, it’s easier to build models that recognize a limited set of phrases.
What’s Text-Dependent Speaker Recognition?
Text-Dependent Speaker Recognition
expects users to say a given phrase or set of them, similar to a password or a secret code. Users have to pay active attention to remember them. Thus, it’s also known as Active Speaker Recognition
. Although Text-Dependency
is inconvenient for end users, limiting the number of phrases reduces the phonetic variability, making it easier for the Speaker Recognition
software to recognize speakers.
What’s Text-Independent Speaker Recognition?
Text-Independent Speaker Recognition
, as the name suggests, does not require users to repeat pre-determined passphrases, allowing for natural conversations without a constraint on content. Users do not need to pay active attention. They can speak freely. Therefore, it’s known as Passive Speaker Recognition
. Text-Independent Speaker Recognition
is more versatile, enabling a wider range of applications, but is more challenging to build.
Pros and Cons of Text-Independent Speaker Recognition
Traditionally, Text-Dependent
and Text-Independent Speaker Recognition
algorithms had natural advantages in certain aspects. For example, Text-Dependent Speaker Recognition
algorithms were more resilient to noise with a smaller footprint and faster processing speed. However, with the variety of training and compression techniques, such general statements are no longer valid. Nowadays, Text-Dependency
is more of a design decision, not a performance measure.
Note that Text-Dependent
and Text Independent Speaker Recognition
algorithms can be language-dependent.
Text-Dependent Speaker Recognition
works only for use cases not designed to recognize speakers continuously. For example, some call center applications verify users at the beginning. Similarly, voice assistants can use Text-Dependent Speaker Recognition
. Alexa voiceID matches voiceprints when users say “Alexa” and personalizes users’ experience. However, applications that recognize speakers throughout the interaction require Text-Independent Speaker Recognition
since recognition has to happen as a part of the natural conversation. Thus, a call center application keeps verifying the identity to flag if another person starts speaking, or a virtual meeting application recognizes the speakers as they talk with Text-Independent Speaker Recognition
.
Try Picovoice’s text-independent Speaker Recognition, Eagle!
What’s Next?
Picovoice’s Free Plan is ideal for individuals exploring, experimenting, and evaluating. No credit card required, no strings attached. Start now, scale later!
Start Building