AI Assistants
and AI Agents
are changing our lives. A Large Language Model
(LLM
) is the standard component of modern GenAI assistants. A voice-based LLM assistant can provide a more natural, efficient, and convenient user experience. Voice Agents
also unlock use cases in call centers and customer support. However, Voice Assistants
require additional voice AI models to work harmoniously with the core LLM "brain" to create a satisfactory end-to-end user experience.
The recent GPT-4o
audio demo proved that an LLM-powered voice assistant done correctly could create an awe-inspiring experience. Unfortunately, GPT-4o's audio feature does not run locally. Prompts and audio data are sent to a server were cloud-based inference is sent and a response is sent back.
What if we need to build an LLM-powered web application and privacy regulations block you from making API requests with voice data? Picovoice runs locally
, meaning voice processing and LLM inference are performed locally without the user's data traveling to a third-party API. Products built with Picovoice are private by design
, compliant
(GDPR
, HIPPA
, ...), and real-time
without unreliable network (API) latency.
First, Let's Play!
Before building an LLM-powered voice assistant, let's check out the experience. Picovoice Web SDKs can be run in:
- Chrome / Edge
- Firefox
- Safari
Picovoice also supports Linux
, macOS
, Windows
, Raspberry Pi
and mobile devices (Android
and iOS
).
Phi-2 on Browsers
The video below shows Picovoice's LLM-powered voice assistant running Microsoft's Phi-2
model on Safari, FireFox and Chrome-based browsers.
Running an LLM locally on browsers enables use cases that can't afford to rely on connectivity, such as healthcare, finance, and legal, where privacy and compliance are paramount.
Anatomy of an LLM Voice Assistant
There are four things an LLM-powered AI assistant needs to do:
- Pay attention to when the user utters the wake word.
- Recognize the request (question or command) the user is uttering.
- Generate a response to a request using LLM.
- Synthesize speech from LLM's text response.
1. Wake Word
A Wake Word Engine
is a voice AI software that understands a single phrase. Every time you use Alexa, Siri, or Hey Google, you activate their wake word engine.
Wake Word Detection
is known as Keyword Spotting
, Hotword Detection
, and Voice Activation
.
2. Streaming Speech-to-Text
Once we know the user is talking to us, we must understand what they say. This is done using Speech-to-Text
. For latency-sensitive applications, we use the real-time variant of speech-to-text, also known as Streaming Speech-to-Text
. The difference is that a Streaming Speech-to-Text transcribes speech as the user talks. In contrast, a normal speech-to-text waits for the user to finish before processing (e.g., OpenAI's Whisper).
Speech-to-Text
(STT
) is also known as Automatic Speech Recogniton
(ASR
). Streaming Speech-to-Text
is also known as Real-Time Speech-to-Text
.
3. LLM Inference
Once the user's request is available in text format, we need to run the LLM to generate the completion. Once the LLM inference starts, it generates the response piece-by-piece (token-by-token). We use this property of LLMs to run them in parallel with speech synthesis to reduce latency (more on this later). LLM inference is very compute intensive, and running it on the device requires techniques to reduce memory and compute requirements. A standard method is quantization (compression).
Are you a deep learning researcher? Learn how picoLLM Compression deeply quantizes LLMs while minimizing loss by optimally allocating bits across and within weights [🧑💻].
Are you a software engineer? Learn how picoLLM Inference Engine runs x-bit quantized Transformers on CPU
and GPU
across Linux
, macOS
, Windows
, iOS
, Android
, Raspberry Pi
, and Web
[🧑💻].
4. Streaming Text-to-Speech
A Text-to-Speech
(TTS
) engine accepts text and synthesizes the corresponding speech signal. Since LLMs can generate responses token-by-token as a stream, we prefer a TTS engine that can accept a stream of text inputs to lower the latency. We call this a Streaming Text-to-Speech
.
Soup to Nuts
This section explains how to code a local LLM-powered voice assistant in JavaScript. You can check the entire script at LLM-powered voice assistant recipe in the picovoice cookbook github repository.
1. Voice Activation
Install Picovoice Porcupine Wake Word Engine:
or using npm:
Import the package, initialize an instance of the wake word engine, and start processing audio in real time:
Replace ${ACCESS_KEY}
with yours obtained from Picovoice Console.
Replace ${MODEL_RELATIVE_PATH}
with the model path relative to the public directory or ${MODEL_BASE64_STRING}
with
the base64 string of the model file.
Replace ${KEYWORD_RELATIVE_PATH}
with the keyword file path relative to the public directory or ${KEYWORD_BASE64_STRING}
with
the base64 string of the keyword file. Replace ${KEYWORD_LABEL}
with a label to identify the keyword.
To learn more about Porcupine, check out the Porcupine docs.
A remarkable feature of Porcupine is that it lets you train your model by just providing the text!
2. Speech Recognition
Install Picovoice Cheetah Streaming Speech-to-Text Engine:
or using npm:
Import the package, initialize an instance of the streaming speech-to-text engine, and start transcribing audio in real time:
Replace ${ACCESS_KEY}
with yours obtained from Picovoice Console and ${ENDPOINT_DURATION_SEC}
with the duration of silence at the end of the user's utterance to make sure they are done talking. The longer it is, the more time the user has to stutter or think in the middle of their request, but it also increases the perceived delay.
Replace ${MODEL_RELATIVE_PATH}
with the model path relative to the public directory or ${MODEL_BASE64_STRING}
with
the base64 string of the model file.
To learn more about Cheetah, check out the Cheetah docs.
3. Response Generation
Install picoLLM Inference Engine:
or using npm:
Import the package, initialize an instance of the LLM inference engine, create a dialog helper object, and start responding to the user's prompts:
Replace ${ACCESS_KEY}
with yours obtained from Picovoice Console.
Replace ${MODEL_FILE_CONTENT}
with picoLLM model, either a path to the resource, a file type or blob type./
To learn more about picoLLM, check out the picoLLM docs.
Note that the LLM's .generate
function provides the response in pieces (i.e., token by token) using its streamCallback
input argument. We pass every token as it becomes available to the Streaming Text-to-Speech and when the .generate
function returns we notify the Streaming Text-to-Speech model that there is no more text and flush any remaining synthesized speech.
4. Speech Synthesis
Install Picovoice Orca Streaming Text-to-Speech Engine:
or using npm:
Import the package, initialize an instance of Orca, and start synthesizing audio in real time:
Replace ${ACCESS_KEY}
with yours obtained from Picovoice Console.
Replace ${MODEL_RELATIVE_PATH}
with the model path relative to the public directory or ${MODEL_BASE64_STRING}
with
the base64 string of the model file.
To learn more about Orca, check out the Orca docs.
What's Next?
The voice assistant we've created above is sufficient but basic. The Picovoice platform allows you to create more complex, multi-dimensional AI software products by mixing and matching our tools. For instance, we could add:
Personalization: We could let the AI assistant not only know what is being said, but who is saying it. This could allow us to create personal profiles on each speaker, with a database of past interactions that allow us to inform how to respond to future interactions. We can achieve this with the Picovoice Eagle Speaker Recognition Engine.
Multi-Turn Conversations: While saying the wake word is a voice activation mechanism for long-running, always-on systems, it becomes cumbersome when we have to go back-and-forth with the assistant during a prolonged multi-turn conversation. We could switch to a different form of activation after the initial interaction, to smooth conversations out. Using the Picovoice Cobra Voice Activity Detection Engine, we could simply detect when the speaker is speaking and when they are waiting for a response.