As Large Language Models (LLMs), such as Llama 2 and Llama 3, continue to push boundaries in artificial intelligence, they are changing how we interact with the technology around us. Voice Assistants have long since been integrated into our mobile phones, but until now their ability to understand and process requests has been quite limited. However, these new LLM style AI models can understand and generate human-like text, making them ideal candidates to augment applications like voice assistants, as well as chatbots and other natural language processing tasks.

However, a major limitation of these AI models is that they require significant resources in order to run their computations. Whilst desktop applications can make use of powerful CPUs and GPUs, mobile phones have much more limited hardware. To make things more difficult, since our mobile devices are with us almost all the time, privacy is also a much larger concern. Network connectivity is also an issue, as a fast reliable signal is not a guarantee. Therefore, in order to make the most use of something like Llama 3 on an Android device, we have to run it offline on the device.

Luckily, Picovoice's picoLLM Inference Engine makes it easy to perform offline LLM inference. picoLLM Inference is a lightweight inference engine that operates locally, ensuring privacy compliance with GDPR and HIPAA regulations, and usability where network connection is a concern. Llama models compressed by picoLLM Compression are small enough that they are able to run on most Android devices.

picoLLM Inference Engine also runs on iOS, Linux, Windows, macOS, Raspberry Pi and modern Web Browsers .

In just a few lines of code, you can start performing LLM inference using the picoLLM Inference Android SDK. Let’s get started!

Before Running Llama on Android

Install picoLLM Packages

The picollm-android package is hosted on the Maven Central Repository. To include the package in your Android project, ensure you have included mavenCentral() in your top-level build.gradle file and then add the following to your app's build.gradle:

Sign up for Picovoice Console

Next, create a Picovoice Console account, and copy your AccessKey from the main dashboard. Creating an account is free, and no credit card is required.

Downloading the picoLLM Compressed Llama 2 or 3 Model File

Download any of the Llama 2 or Llama 3 picoLLM model files (.pllm) from the picoLLM page on Picovoice Console.

Model files are also available other open weight models, such as Gemma, Mistral, Mixtral and Phi-2.

The model needs to be transferred to the device, there are several ways to do this depending on the application use case. For testing it is best to use the Android Debug Bridge (ADB) command adb push to transfer the model file directly to a connected device.

Using picoLLM in an Android Application

Create an instance of picoLLM with your AccessKey and model file path (.pllm):

Pass in your text prompt to the generate function to generate an LLM completion. You may also use .setStreamCallback() to provide a function that handles response tokens as soon as they are available:

There are many configuration options in addition to .setStreamCallback(). For the full list of options, check out the picoLLM Inference Android API docs.

When done, be sure to release the resources explicitly:

For a complete working project, take a look at the picoLLM Completion Android Demo or the picoLLM Chat Android Demo. You can also view the picoLLM Inference Android API docs for complete details on the Android SDK.