AI Assistants
and AI Agents
are changing our lives. A Large Language Model
(LLM
) is the standard component of modern GenAI assistants. A voice-based LLM assistant can provide a more natural, efficient, and convenient user experience. Voice Agents
also unlock use cases in call centers and customer support. However, Voice Assistants
require additional voice AI models to work harmoniously with the LLM brain to create a satisfactory end-to-end user experience.
The recent GPT-4o
audio demo proved that an LLM-powered voice assistant done correctly can create an awe-inspiring
interactive experiences. Unfortunately, GPT-4o's audio feature is not publicly available. Even if it were,
when running on something like a mobile device none of the inference is done on device:
OpenAI
is actually streaming the prompts to a large, server-powered, cloud-based model for inference, which in turn streams its response back.
This presents a few concerns such as privacy, latency and connectivity that are all inherent to a cloud-based approach.
That's where Picovoice's tech stack comes in. Picovoice is on-device
, meaning voice processing and LLM inference are performed locally
on your mobile device without the user's data traveling to a third-party API. Products built with Picovoice are private by design
, compliant
(GDPR
, HIPPA
, ...), and real-time
without unreliable network (API) latency. You can build a voice assistant of the same caliber right now in just a few hundred lines of Swift using Picovoice's on-device voice AI and local LLM stacks.
First, Let's Play!
Before building an LLM-powered voice assistant, let's check out the experience. Picovoice iOS SDKs will run on any iPhone with iOS version 16.0+.
Picovoice is cross-platform and can run on CPU
and GPU
across Linux
, macOS
, Windows
, and Raspberry Pi
.
It also supports mobile devices (Android
and iOS
) and all modern web browsers (Chrome
, Safari
, Edge
, and Firefox
).
Phi-2 on an iPhone 13
The video below shows Picovoice's LLM-powered voice assistant running Microsoft's Phi-2
model on an iPhone 13.
As you can see picoLLM
processes all of the inference on device,
so use-cases where privacy or mobile data usage may be concern are an excellent application for Picovoice.
Anatomy of an LLM Voice Assistant
There are four things an LLM-powered AI assistant needs to do:
- Pay attention to when the user utters the wake word.
- Recognize the request (question or command) the user is uttering.
- Generate a response to a request using LLM.
- Synthesize speech from LLM's text response.
1. Wake Word
A Wake Word Engine
is a voice AI software that understands a single phrase. Every time you use Alexa, Siri, or Hey Google, you activate their wake word engine.
Wake Word Detection
is known as Keyword Spotting
, Hotword Detection
, and Voice Activation
.
2. Streaming Speech-to-Text
Once we know the user is talking to us, we must understand what they say. This is done using Speech-to-Text
. For latency-sensitive applications, we use the real-time variant of speech-to-text, also known as Streaming Speech-to-Text
. The difference is that Streaming Speech-to-Text
transcribes speech as the user talks. In contrast, a normal speech-to-text waits for the user to finish before processing (e.g., OpenAI's Whisper).
Speech-to-Text
(STT
) is also known as Automatic Speech Recogniton
(ASR
). Streaming Speech-to-Text
is also known as Real-Time Speech-to-Text
.
3. LLM Inference
Once the user's request is available in text format, we need to run the LLM to generate the completion. Once the LLM inference starts, it generates the response piece-by-piece (token-by-token). We use this property of LLMs to run them in parallel with speech synthesis to reduce latency (more on this later). LLM inference is very compute intensive, and running it on the device requires techniques to reduce memory and compute requirements. A standard method is quantization (compression).
Are you a deep learning researcher? Learn how picoLLM Compression deeply quantizes LLMs while minimizing loss by optimally allocating bits across and within weights [š§āš»].
Are you a software engineer? Learn how picoLLM Inference Engine runs x-bit quantized Transformers on CPU
and GPU
across Linux
, macOS
, Windows
, iOS
, Android
, Raspberry Pi
, and Web
[š§āš»].
4. Streaming Text-to-Speech
A Text-to-Speech
(TTS
) engine accepts text and synthesizes the corresponding speech signal. Since LLMs can generate responses token-by-token as a stream, we prefer a TTS engine that can accept aĀ stream of text inputs to lower the latency. We call this a Streaming Text-to-Speech
.
Soup to Nuts
This section explains how to code a local LLM-powered voice assistant in Python. You can check the entire script atĀ LLM-powered voice assistant recipeĀ in the picovoice cookbook github repository.
1. Voice Activation
Add Picovoice Porcupine Wake Word Engine to your Podfile
:
Import the module, initialize an instance of the wake word engine, and start processing audio in real time:
Replace $ACCESS_KEY
with yours obtained from Picovoice Console and $PORCUPINE_MODEL_FILE
with the name/path to the file containing the parameters of the keyword model you trained on Picovoice Console.
You can also used one of the BuiltInKeyword
.
A remarkable feature of Porcupine is that it lets you train your model by just providing the text!
2. Speech Recognition
Add Picovoice Cheetah Streaming Speech-to-Text to your Podfile
:
Import the module, initialize an instance of the streaming speech-to-text engine, and start transcribing audio in real time:
Replace $ACCESS_KEY
with yours obtained from Picovoice Console, $CHEETAH_MODEL_FILE
with the name/path to the file containing the parameters of the language model you trained on Picovoice Console.
You can also use the default model found on the Cheetah GitHub Repository.
Also replace $ENDPOINT_DURATION_SEC
with the duration of silence at the end of the user's utterance to make sure they are done talking. The longer it is, the more time the user has to stutter or think in the middle of their request, but it also increases the perceived delay.
3. Response Generation
Add Picovoice picoLLM Inference Engine to your Podfile
:
Import the module, initialize an instance of the LLM inference engine, create a dialog helper object, and start responding to the user's prompts:
Replace $ACCESS_KEY
with yours obtained from Picovoice Console and $LLM_MODEL_FILE
with the name/path to the picoLLM model downloaded from Picovoice Console.
Note that the LLM's .generate
function provides the response in pieces (i.e., token by token) using its streamCallback
input argument. We pass every token as it becomes available to the Streaming Text-to-Speech and when the .generate
function returns we notify the Streaming Text-to-Speech model that there is no more text and flush any remaining synthesized speech.
4. Speech Synthesis
Add Picovoice Orca Streaming Text-to-Speech to your Podfile
:
Import the module, initialize an instance of Orca, and start synthesizing audio in real time:
Replace $ACCESS_KEY
with yours obtained from Picovoice Console and $ORCA_MODEL_FILE
with the name/path to the Orca model downloaded from the Orca GitHub Repository.
What's Next?
The voice assistant we've created above is sufficient but basic. The Picovoice platform allows you to create more complex, multi-dimensional AI software products by mixing and matching our tools. For instance, we could add:
Personalization: We could let the AI assistant not only know what is being said, but who is saying it. This could allow us to create personal profiles on each speaker, with a database of past interactions that allow us to inform how to respond to future interactions. We can achieve this with the Picovoice Eagle Speaker Recognition Engine.
Multi-Turn Conversations: While saying the wake word is a voice activation mechanism for long-running, always-on systems, it becomes cumbersome when we have to go back-and-forth with the assistant during a prolonged multi-turn conversation. We could switch to a different form of activation after the initial interaction, to smooth conversations out. Using the Picovoice Cobra Voice Activity Detection Engine, we could simply detect when the speaker is speaking and when they are waiting for a response.