AI Assistants and AI Agents are changing the modern world in ways that once seemed like science fiction. Powered by Large Language Model (LLM) technology, the new generation of generative AI assistants has awed the public and put AI in the tool belt of even the techno-skeptic. By integrating voice capabilities, LLM-based assistants can offer a more intuitive, efficient, and user-friendly experience. With increasingly natural voice-to-voice, interactions, Voice Agents are opening new possibilities in call centers and customer support. However, these Voice Assistants require voice AI to work in harmony with the LLM intelligence to provide a satisfactory customer-experience.

The recent GPT-4o mobile demo proved that an LLM-powered voice assistant done correctly could create a truly futuristic experience. Unfortunately, we must remember, the OpenAI's ChatGPT app is not running inference on the mobile device. The mobile device is streaming the prompts to a large, server-powered, cloud-based model for inference, which in turn streams its response back. Using a phone for the demo is a handy illusion of optimization though, isn't it?

What if we could run a similar experience locally, without the privacy, latency and connectivity concerns of a cloud-based approach? Picovoice's AI runs 100% on-device - this means that voice processing and LLM inference are performed locally without the user's request or response moving through a third-party API. Products built with Picovoice are private by design, compliant (GDPR, HIPPA, ...), and real-time without unreliable network (API) latency.

First, Let's See It In Action

Before building an LLM-powered voice assistant, let's check out the experience. Picovoice Android SDKs can run on both Android and iOS, allowing for cross-platform mobile experiences.

Picovoice also supports Linux, macOS, Windows, Raspberry Pi, and all modern web browsers (Chrome, Safari, Edge, and Firefox).

Running Phi-2 on Pixel 6a Android Phone

The video below shows Picovoice's LLM-powered voice assistant running Microsoft's Phi-2 model with voice-to-voice communication on a modest Pixel 6a Android phone. All AI processing is done on the device, but there is still enough CPU available for background phone activity.

Anatomy of an LLM Voice Assistant

An AI voice assistant powered by a local LLM must accomplish four key tasks:

  1. Detect when the user speaks the wake word.
  2. Understand the user's request, whether it’s a question or a command.
  3. Generate a response to the request using the LLM.
  4. Convert the LLM's text response into synthesized speech that can be played back to the user.

1. Wake Word

A Wake Word Engine is voice AI software designed to recognize a specific phrase. Each time you say "Alexa," "Siri," or "Hey Google," you activate their wake word engine.

Wake Word Detection is known as Keyword Spotting, Hotword Detection, and Voice Activation.

2. Streaming Speech-to-Text

Once the wake word is detected, we need to understand the user's request. This is achieved using a Speech-to-Text engine. For applications sensitive to latency, we use the real-time Streaming Speech-to-Text. Unlike traditional speech-to-text, which processes the entire speech after the user finishes speaking, Streaming Speech-to-Text transcribes speech in real-time as the user speaks (e.g., OpenAI's Whisper).

Speech-to-Text (STT) is also referred to as Automatic Speech Recognition (ASR), and Streaming Speech-to-Text can also be called Real-Time Speech-to-Text.

3. LLM Inference

After converting the user's speech to text, we run prompt the local LLM with the text of the request and let it generate the appropriate response. The LLM produces the response incrementally, token-by-token, which allows us to run speech synthesis simultaneously, reducing latency (more on this later). LLM inference is computationally intensive, requiring techniques to minimize memory and compute usage, such as quantization (compression) and platform-specific hardware acceleration.

Are you a deep learning researcher? Learn how picoLLM Compression deeply quantizes LLMs while minimizing loss by optimally allocating bits across and within weights [πŸ§‘β€πŸ’»].

Are you a software engineer? Learn how picoLLM Inference Engine runs x-bit quantized Transformers on CPU and GPU across Linux, macOS, Windows, iOS, Android, Raspberry Pi, and Web [πŸ§‘β€πŸ’»].

4. Streaming Text-to-Speech

A Text-to-Speech (TTS) engine takes text input and synthesizes corresponding speech audio. Since LLMs produce responses token by token in a stream, we utilize a TTS engine capable of handling a continuous stream of text inputs to minimize latency. This is known as Streaming Text-to-Speech.

Make Your Own Local Voice Assistant for Android

The following guide will walk you through integrating each element of the end-to-end AI pipeline into your Android project. If you just want to skip right to full app and start experimenting, check out the pico-cookbook repository on GitHub.

1. Voice Activation

Add the Picovoice Porcupine Wake Word Engine to your gradle.build file:

Then, you can create an instance of the wake word engine, and start processing audio in real time:

Replace $ACCESS_KEY with yours obtained from Picovoice Console and $KEYWORD_PATH with the absolute path to the keyword model file (.ppn) that you trained on Picovoice Console.

A remarkable feature of Porcupine is that it lets you train a custom wake word by just typing it in!

2. Speech Recognition

Add the Picovoice Cheetah Streaming Speech-to-Text Engine to your dependencies:

Then you can initialize an instance of the streaming speech-to-text engine, and start transcribing audio in real time:

Replace $ACCESS_KEY with your AccessKey, which you can obtain from Picovoice Console and $MODEL_PATH with the path to the language model files (.pv) that you can either customize and train on Picovoice Console, or you can use the default version from the Cheetah GitHub Repository.

3. Response Generation

Add the picoLLM Inference Engine to your dependencies:

Initialize an instance of the LLM inference engine, create a dialog helper object, and start responding to the user's prompts:

Replace $ACCESS_KEY with yours obtained from Picovoice Console and $LLM_MODEL_PATH with the path to the picoLLM model (.pllm) downloaded from Picovoice Console.

picoLLM supports a variety of open-weight models such as Llama, Gemma, Mistral and Mixtral.

Note that the LLM's .generate function provides the response in pieces (i.e., token by token) using its setStreamCallback argument. We pass every token as it becomes available to the Streaming Text-to-Speech and when the .generate function returns we notify the Streaming Text-to-Speech engine that there is no more text and finalize the speech synthesis.

4. Speech Synthesis

Add Picovoice Orca Streaming Text-to-Speech Engine to your build.gradle dependencies:

Initialize an instance of Orca, and start synthesizing audio in real time:

Replace $ACCESS_KEY with yours obtained from Picovoice Console and $MODEL_PATH with the path to the voice model file (.pv) that you can pick from the Orca GitHub Repository.

What's Next?

The voice assistant we've created above is sufficient but basic. The Picovoice platform allows you to create more complex, multi-dimensional AI software products by mixing and matching our tools. For instance, we could add:

  • Personalization: We could let the AI assistant not only know what is being said, but who is saying it. This could allow us to create personal profiles on each speaker, with a database of past interactions that allow us to inform how to respond to future interactions. We can achieve this with the Picovoice Eagle Speaker Recognition Engine.

  • Multi-Turn Conversations: While saying the wake word is a voice activation mechanism for long-running, always-on systems, it becomes cumbersome when we have to go back-and-forth with the assistant during a prolonged multi-turn conversation. We could switch to a different form of activation after the initial interaction, to smooth conversations out. Using the Picovoice Cobra Voice Activity Detection Engine, we could simply detect when the speaker is speaking and when they are waiting for a response.