Speaker Recognition
is the technology that is used to identify and verify speakers based on their distinguishable voice characteristics, such as pitch, frequency, and duration of sounds. Speaker recognition
focuses on “Who is speaking?” rather than “What is said?”. Speaker Recognition
enables several applications, such as:
- Voice verification in banking, e.g., Wells Fargo Voice Verification,
- User-level settings and access, e.g., Alexa Voice ID,
- Device assignment, e.g., Apple Personalized Hey Siri (PHS),
- Speaker search in forensics and security, e.g., NSA Voice in Real Time (Voice RT)
How does Speaker Recognition work?
Speaker Recognition
has two phases: Enrollment
and Matching
. In the Enrollment
phase, Speaker Recognition
engines capture users’ voice samples and extract voice characteristics. Then, they create Speaker ID
s, known as Voiceprint
or Voice ID
, using extracted voice characteristics.
In the Matching
phase, Speaker Recognition
engines compare new voice samples to the Speaker ID
s and determine the likelihood of a match. Like the enrollment process, this step involves extracting the unique characteristics from the voice data. By the end of this step, the Speaker Recognition
engine returns a score. Researchers use different techniques to calculate the score. While Gaussian Mixture Models (GMMs) and Support Vector Machines (SVMs) were popular techniques in the last decade, next-generation Speaker Recognition
models leverage deep learning.
There are three types of Matching
:
- One-to-One:
Matching
a voice sample with the original voiceprint (speaker ID) claimed by the speaker, used in verification. - One-to-Many:
Matching
a voice sample within multiple voiceprints stored in the database, used in identification. - Many-to-Many:
Matching
multiple voice samples within multiple voiceprints stored in the database, used in speaker clustering.
Challenges of speaker recognition
Speaker Recognition
is a complex technology due to the nature of the human voice. Physiological, behavioral, and environmental factors such as accent, age, gender, emotion, and acoustic environment can affect the characteristics of a person's voice. Similarly, phonetic variability across phrases and languages affects the performance of Speaker Recognition
engines, making it challenging to build or choose Speaker Recognition software that works that works.
Acknowledging the challenges of building and finding a good Speaker Recognition
engine, Picovoice decided to release its internal Speaker Recognition, Eagle. Eagle Speaker Recognition and Identification, powered by deep learning, is highly accurate, lightweight, and cross-platform. Don't just take our word - try it yourself or, even better, get started with the Free Plan!