First, let’s start with what
Trigger Word Detection means.
Trigger Word Detection is an application of keyword spotting.
Trigger Words, as the name suggests, trigger an action when detected. The action could be waking up a dormant device, such as a smart speaker or activating a mobile application, e.g. medical dictation software. That’s why
Trigger Words are also known as wake words or always-listening commands. Hey Google, Alexa and Hey Siri are the most known
Big Tech owns and uses
Trigger Words for its products. However, they do not sell this technology as they sell Spoken Language Understanding and Speech-to-Text solutions. That’s why sometimes developers ask us why Big Tech does not sell
Trigger Word Detection (or
Wake Word Detection).
The most common assumption behind this question is that when grammar is small recognizing voice commands should be easier than open-domain dictation. We cannot answer the question on behalf of big tech, but we can unwrap the rationale behind the question.
Recognizing voice commands “efficiently and accurately” is still difficult, despite the small grammar size. Let’s start with the things you should know about
Trigger Word Detection:
Trigger Word Detectionshould be always-listening, so it should run on the device, not in the cloud.
Trigger Word Detectionshould be efficient, hence specific to the use case. [That’s why generic models such as automatic speech recognition are not a fit.]
Trigger Word Detectionshould be accurate, measured by the FRR and FAR.
To achieve these three conditions, the models used for
Trigger Word Detection and the software used to detect the Trigger Word should be small, power-efficient and work across non-homogeneous platforms.
Trigger Words should be small.
The standard approach to training speech models is to provide them with a significant amount of data to achieve high accuracy, resulting in large models. Storing large models in small devices such as MCUs or within web browsers isn’t possible or efficient. Thus, developing accurate and efficient
Trigger Word Detection software requires a different approach.
Trigger Words should be power efficient.
Speech models, like any software, use memory (i.e. storage) and CPU (i.e. power) to perform tasks.
Trigger Word Detection runs on the device and shares the computational power with other software. Every platform has finite resources. Using large and poorly optimized solutions for
Trigger Word Detection consumes more resources and results in reduced device capabilities to perform multiple tasks and battery drain.
Trigger Words should run across non-homogeneous platforms.
Processing data in the cloud or on an iPhone or an Echo device requires optimization only for one platform. Processing data on-device across various platforms requires optimization for each platform separately. That’s why Picovoice optimizes each engine, including Porcupine Wake Word Detection, for MCUs, web browsers, mobile applications or desktop/server applications separately. That requires expertise on each platform to optimize efficiently, integrate smoothly and document clearly, resulting in significant investment.
If you’re ready to train your
Trigger Word, read our tips on choosing trigger words and start building with the Free Tier!