First, let’s start with what Trigger Word Detection
means. Trigger Word Detection
is an application of keyword spotting. Trigger Words
, as the name suggests, trigger an action when detected. The action could be waking up a dormant device, such as a smart speaker or activating a mobile application, e.g. medical dictation software. That’s why Trigger Words
are also known as wake words or always-listening commands. Hey Google, Alexa and Hey Siri are the most known Trigger Words
.
Big Tech owns and uses Trigger Words
for its products. However, they do not sell this technology as they sell Spoken Language Understanding and Speech-to-Text solutions. That’s why sometimes developers ask us why Big Tech does not sell Trigger Word Detection
(or Wake Word Detection
).
The most common assumption behind this question is that when grammar is small recognizing voice commands should be easier than open-domain dictation. We cannot answer the question on behalf of big tech, but we can unwrap the rationale behind the question.
Recognizing voice commands “efficiently and accurately” is still difficult, despite the small grammar size. Let’s start with the things you should know about Trigger Word Detection
:
Trigger Word Detection
should be always-listening, so it should run on the device, not in the cloud.Trigger Word Detection
should be efficient, hence specific to the use case. [That’s why generic models such as automatic speech recognition are not a fit.]Trigger Word Detection
should be accurate, measured by the FRR and FAR.
To achieve these three conditions, the models used for Trigger Word Detection
and the software used to detect the Trigger Word should be small, power-efficient and work across non-homogeneous platforms.
Trigger Words should be small.
The standard approach to training speech models is to provide them with a significant amount of data to achieve high accuracy, resulting in large models. Storing large models in small devices such as MCUs or within web browsers isn’t possible or efficient. Thus, developing accurate and efficient Trigger Word Detection
software requires a different approach.
Trigger Words should be power efficient.
Speech models, like any software, use memory (i.e. storage) and CPU (i.e. power) to perform tasks. Trigger Word Detection
runs on the device and shares the computational power with other software. Every platform has finite resources. Using large and poorly optimized solutions for Trigger Word Detection
consumes more resources and results in reduced device capabilities to perform multiple tasks and battery drain.
Trigger Words should run across non-homogeneous platforms.
Processing data in the cloud or on an iPhone or an Echo device requires optimization only for one platform. Processing data on-device across various platforms requires optimization for each platform separately. That’s why Picovoice optimizes each engine, including Porcupine Wake Word Detection, for MCUs, web browsers, mobile applications or desktop/server applications separately. That requires expertise on each platform to optimize efficiently, integrate smoothly and document clearly, resulting in significant investment.
If you’re ready to train your Trigger Word
, read our tips on choosing trigger words and start building with the Free Plan!