Speaker Recognition is the technology that is used to identify and verify speakers based on their distinguishable voice characteristics, such as pitch, frequency, and duration of sounds.
Speaker recognition focuses on “Who is speaking?” rather than “What is said?”.
Speaker Recognition enables several applications, such as:
- Voice verification in banking, e.g., Wells Fargo Voice Verification ,
- User-level settings and access, e.g., Alexa Voice ID ,
- Device assignment, e.g., Apple Personalized Hey Siri (PHS) ,
- Speaker search in forensics and security, e.g., NSA Voice in Real Time (Voice RT)
How does Speaker Recognition work?
Speaker Recognition has two phases:
Matching. In the
Speaker Recognition engines capture users’ voice samples and extract voice characteristics. Then, they create
Speaker IDs, known as
Voice ID, using extracted voice characteristics.
Speaker Recognition engines compare new voice samples to the
Speaker IDs and determine the likelihood of a match. Like the enrollment process, this step involves extracting the unique characteristics from the voice data. By the end of this step, the
Speaker Recognition engine returns a score. Researchers use different techniques to calculate the score. While Gaussian Mixture Models (GMMs) and Support Vector Machines (SVMs) were popular techniques in the last decade, next-generation
Speaker Recognition models leverage deep learning.
There are three types of
Matchinga voice sample with the original voiceprint (speaker ID) claimed by the speaker, used in verification.
Matchinga voice sample within multiple voiceprints stored in the database, used in identification.
Matchingmultiple voice samples within multiple voiceprints stored in the database, used in speaker clustering.
Challenges of speaker recognition
Speaker Recognition is a complex technology due to the nature of the human voice. Physiological, behavioral, and environmental factors such as accent, age, gender, emotion, and acoustic environment can affect the characteristics of a person's voice. Similarly, phonetic variability across phrases and languages affects the performance of
Speaker Recognition engines, making it challenging to build or choose Speaker Recognition software that works that works.
Acknowledging the challenges of building and finding a good
Speaker Recognition engine, Picovoice decided to release its internal Speaker Recognition, Eagle. Eagle Speaker Recognition and Identification, powered by deep learning, is highly accurate, lightweight, and cross-platform. Don't just take our word - try it yourself or, even better, get started with the Free Plan!