Speaker Recognition engines focus on identifying the speaker rather than the content of their speech. Thus, people assume
Speaker Recognition engines are
Text-Independent. However, it is not always the case. Especially, models trained using the old-school methodologies are text-dependent because, technically, it’s easier to build models that recognize a limited set of phrases.
What’s Text-Dependent Speaker Recognition?
Text-Dependent Speaker Recognition expects users to say a given phrase or set of them, similar to a password or a secret code. Users have to pay active attention to remember them. Thus, it’s also known as
Active Speaker Recognition. Although
Text-Dependency is inconvenient for end users, limiting the number of phrases reduces the phonetic variability, making it easier for the
Speaker Recognition software to recognize speakers.
What’s Text-Independent Speaker Recognition?
Text-Independent Speaker Recognition, as the name suggests, does not require users to repeat pre-determined passphrases, allowing for natural conversations without a constraint on content. Users do not need to pay active attention. They can speak freely. Therefore, it’s known as
Passive Speaker Recognition.
Text-Independent Speaker Recognition is more versatile, enabling a wider range of applications, but is more challenging to build.
Pros and Cons of Text-Independent Speaker Recognition
Text-Independent Speaker Recognition algorithms had natural advantages in certain aspects. For example,
Text-Dependent Speaker Recognition algorithms were more resilient to noise with a smaller footprint and faster processing speed. However, with the variety of training and compression techniques, such general statements are no longer valid. Nowadays,
Text-Dependency is more of a design decision, not a performance measure.
Text Independent Speaker Recognition algorithms can be language-dependent.
Text-Dependent Speaker Recognition works only for use cases not designed to recognize speakers continuously. For example, some call center applications verify users at the beginning. Similarly, voice assistants can use
Text-Dependent Speaker Recognition. Alexa voiceID matches voiceprints when users say “Alexa” and personalizes users’ experience. However, applications that recognize speakers throughout the interaction require
Text-Independent Speaker Recognition since recognition has to happen as a part of the natural conversation. Thus, a call center application keeps verifying the identity to flag if another person starts speaking, or a virtual meeting application recognizes the speakers as they talk with
Text-Independent Speaker Recognition.
Try Picovoice’s text-independent Speaker Recognition, Eagle!