One should never use Automatic Speech Recognition
(ASR) to detect Wake Words
or always-listening commands.
Imagine you’re in a crowded room, maybe in a meeting you don’t know why you are there. You’d probably stop paying attention to what others say after some time. Yet, when you hear your name, you immediately start paying attention. Wake Words
-powered voice products work similarly. Applications stay idle until they hear their names. Using Automatic Speech Recognition
for Wake Word
recognition is like listening to every conversation, whether it’s relevant or not. It requires significant brain power, even for machines.
We discussed why Wake Word
recognition should not be in the cloud. Let’s unwrap why one should never use Automatic Speech Recognition
solutions, even those run on the device, for Wake Word
recognition.
Automatic Speech Recognition
solutions are not a good fit because they
- are highly computational, requiring a large number of resources,
- have known accuracy challenges when it comes to proper nouns and homophones,
- may wait for decoding until the end of the recording.
Computational complexity
A Wake Word
recognition engine should run continuously to trigger an action and run across platforms, including low-power applications such as wearables or IoT devices. The computational complexity of Automatic Speech Recognition
requires significant CPU and memory to process data. Running large models with an unquenchable thirst for computing 7/24 limits the resources used by other applications and drains the battery of mobile devices. If an application requires technicians to recharge devices between visits, it will be an overwhelming experience and cause productivity losses, i.e. costs to enterprises.
Picovoice’s Porcupine model size is less than 1 MB. The model size of Picovoice’s on-device ASRs is 20 MB, and that of the recently launched Whisper by OpenAI varies from 75 MB to 3 GB. As expected, memory requirements also increase with model sizes. Running these models on 7/24 is not feasible. Obviously, turning ASR on and off for efficiency conflicts with the definition of a hands-free experience, which is the promise of Wake Words
.
Known Accuracy Challenges
Correct recognition of proper names and homophones are known challenges in Automatic Speech Recognition
. Whether, weather, wether is an example for homophones. Wake word choices affect accuracy regardless of software. They should have distinct sounds. However, general Automatic Speech Recognition
models struggle to differentiate similar sounds more than other engines. If an engine transcribes Alexa as “all extra” or “affects a”, then it won’t work. While choosing software, it’s crucial to consider False Acceptance and False Rejection rates, hence the accuracy and user experience.
Decoding Time
Wake Word
recognition serves as a trigger for later interactions between users and applications. Thus, software should trigger an action simultaneously. If an ASR waits for the entire utterance “Alexa, play the next song” to transcribe it, it will cause delays. It again conflicts with the promise of Wake Words
. Offline decoding may be admissible in some ASR applications. However, for Wake Word
, decoding should start as soon as the audio stream is available.
Automatic Speech Recognition
is phenomenal technology, yet unfit for some problems. You should choose the right engine. For example, one can find footprints of Kaldi in Alexa easily. But not in the Wake Word
detection module. If you’re still not convinced, test it! Use Picovoice’s open-source wake word benchmark, train a custom Wake Word in seconds, and compare Porcupine Wake Word with any Automatic Speech Recognition
software.