Speaker Recognition engines focus on identifying the speaker rather than the content of their speech. Thus, people assume Speaker Recognition engines are Text-Independent. However, this is not always the case. Especially, models trained using the old-school methodologies are text-dependent because, technically, it’s easier to build models that recognize a limited set of phrases.

What’s Text-Dependent Speaker Recognition?

Text-Dependent Speaker Recognition expects users to say a given phrase or set of them, similar to a password or a secret code. Users have to pay active attention to remember them. Thus, it’s also known as Active Speaker Recognition. Although Text-Dependency is inconvenient for end users, limiting the number of phrases reduces the phonetic variability, making it easier for the Speaker Recognition software to recognize speakers.

What’s Text-Independent Speaker Recognition?

Text-Independent Speaker Recognition, as the name suggests, does not require users to repeat pre-determined passphrases, allowing for natural conversations without a constraint on content. Users do not need to pay active attention. They can speak freely. Therefore, it’s known as Passive Speaker Recognition. Text-Independent Speaker Recognition is more versatile, enabling a wider range of applications, but is more challenging to build.

Pros and Cons of Text-Independent Speaker Recognition

Traditionally, Text-Dependent and Text-Independent Speaker Recognition algorithms had natural advantages in certain aspects. For example, Text-Dependent Speaker Recognition algorithms were more resilient to noise with a smaller footprint and faster processing speed. However, with the variety of training and compression techniques, such general statements are no longer valid. Nowadays, Text-Dependency is more of a design decision, not a performance measure.

Note that Text-Dependent and Text Independent Speaker Recognition algorithms can be language-dependent.

Text-Dependent Speaker Recognition works only for use cases not designed to recognize speakers continuously. For example, some call center applications verify users at the beginning. Similarly, voice assistants can use Text-Dependent Speaker Recognition. Alexa voiceID matches voiceprints when users say “Alexa” and personalizes users’ experience. However, applications that recognize speakers throughout the interaction require Text-Independent Speaker Recognition since recognition has to happen as a part of the natural conversation. Thus, a call center application keeps verifying the identity to flag if another person starts speaking, or a virtual meeting application recognizes the speakers as they talk with Text-Independent Speaker Recognition.

Try Picovoice’s text-independent Speaker Recognition, Eagle!

Click on "Enroll a speaker"
to get started

What’s Next?

Picovoice’s Free Plan is ideal for individuals exploring, experimenting, and evaluating. No credit card required, no strings attached. Start now, scale later!

Start Building