Speaker Recognition is the technology that is used to identify and verify speakers based on their distinguishable voice characteristics, such as pitch, frequency, and duration of sounds. Speaker recognition focuses on "Who is speaking?" rather than "What is said?". Speaker Recognition enables several applications, such as:
- Voice verification in banking, e.g., Wells Fargo Voice Verification,
- User-level settings and access, e.g., Alexa Voice ID,
- Device assignment, e.g., Apple Personalized Hey Siri (PHS),
- Speaker search in forensics and security, e.g., NSA Voice in Real Time (Voice RT)
How does Speaker Recognition work?
Speaker Recognition has two phases: Enrollment and Matching. In the Enrollment phase, Speaker Recognition engines capture users’ voice samples and extract voice characteristics. Then, they create Speaker IDs, known as Voiceprint or Voice ID, using extracted voice characteristics.
In the Matching phase, Speaker Recognition engines compare new voice samples to the Speaker IDs and determine the likelihood of a match. Like the enrollment process, this step involves extracting the unique characteristics from the voice data. By the end of this step, the Speaker Recognition engine returns a score. Researchers use different techniques to calculate the score. While Gaussian Mixture Models (GMMs) and Support Vector Machines (SVMs) were popular techniques in the last decade, next-generation Speaker Recognition models leverage deep learning.
There are three types of Matching:
- One-to-One:
Matchinga voice sample with the original voiceprint (speaker ID) claimed by the speaker, used in verification. - One-to-Many:
Matchinga voice sample within multiple voiceprints stored in the database, used in identification. - Many-to-Many:
Matchingmultiple voice samples within multiple voiceprints stored in the database, used in speaker clustering.
Challenges of speaker recognition
Speaker Recognition is a complex technology due to the nature of the human voice. Physiological, behavioral, and environmental factors such as accent, age, gender, emotion, and acoustic environment can affect the characteristics of a person's voice. Similarly, phonetic variability across phrases and languages affects the performance of Speaker Recognition engines, making it challenging to build or choose Speaker Recognition software that works.
Acknowledging the challenges of building and finding a good Speaker Recognition engine, Picovoice decided to release its internal Speaker Recognition, Eagle. Eagle Speaker Recognition and Identification, powered by deep learning, is highly accurate, lightweight, and cross-platform. Don't just take our word - evaluate it for free!







