When people talk about
Speech Recognition on
Raspberry Pi, they mean
STT). Why? Because
is the most known (used) form of
Speech Recognition. But it's almost always NOT the right tool! Specifically, if you are
thinking about building something on a
Raspberry Pi. Let's look at use cases and what is the best tool to do it.
Wake Word Detection
Wake Word Engine detects occurrences of a given phrase in audio.
Hey Google are examples of
wake words. Picovoice Porcupine Wake Word Engine empowers you to train custom
wake words (e.g.
Raspberry Pi. It is so efficient that it can even run on
Raspberry Pi Zero in real time.
Wake Word Detection is also known as
Trigger Word Detection,
Wake up Word Detection,
You can use a
Wake Word Detection if you don't care that you are using much more power or,
even worse, sending voice data out of your device 24/7. The accuracy is significantly less than the proper solution!
Wake Word Engine is optimized to do one thing well, hence is smaller and more accurate.
Speech-to-Text when all you need is a
Wake Word Engine!
Voice Commands & Domain-Specific NLU
Intent inference (
Intent Detection) from
Voice Commands is at the core of any modern
Voice User Interface (
Typically, the spoken commands are within a well-defined domain of interest, i.e.
Context. For example:
- Turn on the lights in the living room.
- Play Tom Sawyer album by Rush.
- Call John Smith on his cell phone.
The dominant approach for inferring users' intent from spoken commands is to use a
Speech-to-Text engine. Then parse
the text output using Grammar-Based
Natural Language Understanding (
NLU) or an ML-based
NLU engine. The first shortcoming
of this approach is low accuracy.
Speech-to-Text introduces transcription errors that adversely affect subsequent intent inference steps.
Picovoice Rhino Speech-to-Intent Engine fuses
NLU to minimize these adverse
errors and optimizes transcription and inference steps. What is the result? It is even more accurate than cloud-based
In terms of runtime, Rhino only takes a few MBs of FLASH. It runs in real-time, consuming only a single-digit percentage
of one CPU core. It even runs on
Raspberry Pi Zero.
You want to transcribe speech to text in an open domain. i.e. users can say whatever they want. Then you need a
Picovoice Leopard and Cheetah Speech-to-Text engines run on
Raspberry Pi 3 and
Raspberry Pi 4 and match
the accuracy of cloud-based APIs (
IBM Watson Speech-to-Text,
They only take about 20 MB of FLASH and can run on a single CPU.
Voice Activity Detection (VAD)
Do you want to know if someone is talking? You need a
Voice Activity Detection (
VAD) Engine such as Picovoice Cobra VAD.