One should never use
Automatic Speech Recognition (ASR) to detect
Wake Words or always-listening commands.
Imagine you’re in a crowded room, maybe in a meeting you don’t know why you are there. You’d probably stop paying attention to what others say after some time. Yet, when you hear your name, you immediately start paying attention.
Wake Words-powered voice products work similarly. Applications stay idle until they hear their names. Using
Automatic Speech Recognition for
Wake Word recognition is like listening to every conversation, whether it’s relevant or not. It requires significant brain power, even for machines.
We discussed why
Wake Word recognition should not be in the cloud. Let’s unwrap why one should never use
Automatic Speech Recognition solutions, even those run on the device, for
Wake Word recognition.
Automatic Speech Recognition solutions are not a good fit because they
- are highly computational, requiring a large number of resources,
- have known accuracy challenges when it comes to proper nouns and homophones,
- may wait for decoding until the end of the recording.
Wake Word recognition engine should run continuously to trigger an action and run across platforms, including low-power applications such as wearables or IoT devices. The computational complexity of
Automatic Speech Recognition requires significant CPU and memory to process data. Running large models with an unquenchable thirst for computing 7/24 limits the resources used by other applications and drains the battery of mobile devices. If an application requires technicians to recharge devices between visits, it will be an overwhelming experience and cause productivity losses, i.e. costs to enterprises.
Picovoice’s Porcupine model size is less than 1 MB. The model size of Picovoice’s on-device ASRs is 20 MB, and that of the recently launched Whisper by OpenAI varies from 75 MB to 3 GB. As expected, memory requirements also increase with model sizes. Running these models on 7/24 is not feasible. Obviously, turning ASR on and off for efficiency conflicts with the definition of a hands-free experience, which is the promise of
Known Accuracy Challenges
Correct recognition of proper names and homophones are known challenges in
Automatic Speech Recognition. Whether, weather, wether is an example for homophones. Wake word choices affect accuracy regardless of software. They should have distinct sounds. However, general
Automatic Speech Recognition models struggle to differentiate similar sounds more than other engines. If an engine transcribes Alexa as “all extra” or “affects a”, then it won’t work. While choosing software, it’s crucial to consider False Acceptance and False Rejection rates, hence the accuracy and user experience.
Wake Word recognition serves as a trigger for later interactions between users and applications. Thus, software should trigger an action simultaneously. If an ASR waits for the entire utterance “Alexa, play the next song” to transcribe it, it will cause delays. It again conflicts with the promise of
Wake Words. Offline decoding may be admissible in some ASR applications. However, for
Wake Word, decoding should start as soon as the audio stream is available.
Automatic Speech Recognition is phenomenal technology, yet unfit for some problems. You should choose the right engine. For example, one can find footprints of Kaldi in Alexa easily. But not in the
Wake Word detection module. If you’re still not convinced, test it! Use Picovoice’s open-source wake word benchmark, train a custom Wake Word in seconds, and compare Porcupine Wake Word with any
Automatic Speech Recognition software.