When people talk about Speech Recognition
on Raspberry Pi
, they mean Speech-to-Text
(STT
). Why? Because Speech-to-Text
is the most known (used) form of Speech Recognition
. But it's almost always NOT the right tool! Specifically, if you are
thinking about building something on a Raspberry Pi
. Let's look at use cases and what is the best tool to do it.
Wake Word Detection
A Wake Word Engine
detects occurrences of a given phrase in audio. Alexa
and Hey Google
are examples of
wake words. Picovoice Porcupine Wake Word Engine empowers you to train custom
wake words (e.g. Jarvis
) for Raspberry Pi
. It is so efficient that it can even run on Raspberry Pi Zero
in real time.
Wake Word Detection
is also known as Trigger Word Detection
, Keyword Spotting
, Hotword Detection
, Wake up Word Detection
,
and Voice Activation
.
You can use a Speech-to-Text
for Wake Word Detection
if you don't care that you are using much more power or,
even worse, sending voice data out of your device 24/7. The accuracy is significantly less than the proper solution!
Why? A Wake Word Engine
is optimized to do one thing well, hence is smaller and more accurate.
Don't use Speech-to-Text
when all you need is a Wake Word Engine
!
Voice Commands & Domain-Specific NLU
Intent inference
(Intent Detection
) from Voice Commands
is at the core of any modern Voice User Interface
(VUI
).
Typically, the spoken commands are within a well-defined domain of interest, i.e. Context
. For example:
- Turn on the lights in the living room.
- Play Tom Sawyer album by Rush.
- Call John Smith on his cell phone.
The dominant approach for inferring users' intent from spoken commands is to use a Speech-to-Text
engine. Then parse
the text output using Grammar-Based Natural Language Understanding
(NLU
) or an ML-based NLU
engine. The first shortcoming
of this approach is low accuracy. Speech-to-Text
introduces transcription errors that adversely affect subsequent intent inference steps.
Picovoice Rhino Speech-to-Intent Engine fuses Speech-to-Text
and NLU
to minimize these adverse
errors and optimizes transcription and inference steps. What is the result? It is even more accurate than cloud-based
APIs.
In terms of runtime, Rhino only takes a few MBs of FLASH. It runs in real-time, consuming only a single-digit percentage
of one CPU core. It even runs on Raspberry Pi Zero
.
Open-Domain Transcription
You want to transcribe speech to text in an open domain. i.e. users can say whatever they want. Then you need a Speech-to-Text
engine.
Picovoice Leopard and Cheetah Speech-to-Text engines run on Raspberry Pi 3
and Raspberry Pi 4
and match
the accuracy of cloud-based APIs (Google Speech-to-Text
, Amazon Transcribe
, IBM Watson Speech-to-Text
, Azure Speech-to-Text
).
They only take about 20 MB of FLASH and can run on a single CPU.
Voice Activity Detection (VAD)
Do you want to know if someone is talking? You need a Voice Activity Detection
(VAD
) Engine such as Picovoice Cobra VAD.