Why Google Doesn't Offer Trigger Word Detection

🎯 Voice AI Consulting

Get dedicated support and consultation to ensure your specific needs are met.

First, let's start with what Trigger Word Detection means. Trigger Word Detection is an application of keyword spotting. Trigger Words, as the name suggests, trigger an action when detected. The action could be waking up a dormant device, such as a smart speaker or activating a mobile application, e.g. medical dictation software. That's why Trigger Words are also known as wake words or always-listening commands. Hey Google, Alexa and Hey Siri are the most known Trigger Words.

Big Tech owns and uses Trigger Words for its products. However, they do not sell this technology as they sell Spoken Language Understanding and Speech-to-Text solutions. That's why sometimes developers ask us why Big Tech does not sell Trigger Word Detection (or Wake Word Detection).

The most common assumption behind this question is that when grammar is small recognizing voice commands should be easier than open-domain dictation. We cannot answer the question on behalf of big tech, but we can unwrap the rationale behind the question.

Recognizing voice commands "efficiently and accurately" is still difficult, despite the small grammar size. Let's start with the things you should know about Trigger Word Detection:

Trigger Word Detection should be always-listening, so it should run on the device, not in the cloud.
Trigger Word Detection should be efficient, hence specific to the use case. [That's why generic models such as automatic speech recognition are not a fit.]
Trigger Word Detection should be accurate, measured by the FRR and FAR.

To achieve these three conditions, the models used for Trigger Word Detection and the software used to detect the Trigger Word should be small, power-efficient and work across non-homogeneous platforms.

Trigger Words should be small.

The standard approach to training speech models is to provide them with a significant amount of data to achieve high accuracy, resulting in large models. Storing large models in small devices such as MCUs or within web browsers isn't possible or efficient. Thus, developing accurate and efficient Trigger Word Detection software requires a different approach.

Trigger Words should be power efficient.

Speech models, like any software, use memory (i.e. storage) and CPU (i.e. power) to perform tasks. Trigger Word Detection runs on the device and shares the computational power with other software. Every platform has finite resources. Using large and poorly optimized solutions for Trigger Word Detection consumes more resources and results in reduced device capabilities to perform multiple tasks and battery drain.

Trigger Words should run across non-homogeneous platforms.

Processing data in the cloud or on an iPhone or an Echo device requires optimization only for one platform. Processing data on-device across various platforms requires optimization for each platform separately. That's why Picovoice optimizes each engine, including Porcupine Wake Word Detection, for MCUs, web browsers, mobile applications or desktop/server applications separately. That requires expertise on each platform to optimize efficiently, integrate smoothly and document clearly, resulting in significant investment.

If you're ready, read our tips on choosing trigger words and create a Picovoice Console account for free to train your own Trigger Word!

Start Building

Hey Google, Why Don't You Offer Trigger Word Detection?

Trigger Words should be small.

Trigger Words should be power efficient.

Trigger Words should run across non-homogeneous platforms.

More from Picovoice