Large Language Models (LLMs), such as Llama 2 and Llama 3, represent significant advancements in technology, improving how AI understands and generates human-like text with increased accuracy and context sensitivity. Therefore, these models are useful for creating voice assistance, chatbots, and natural language processing tasks.

Most of these use cases are particularly helpful when running on handheld devices since the device is always next to us. However, unlike desktop applications that can take advantage of powerful CPUs and GPUs, mobile phones have hardware limitations preventing fast and accurate responses. Server-side solutions may also provide issues as network connectivity is not reliable and privacy is a concern regarding personal information leaving the handheld device.

Luckily, Picovoice's picoLLM Inference engine makes it easy to perform offline LLM inference. picoLLM Inference is a lightweight inference engine that operates locally, ensuring privacy compliance with GDPR and HIPAA regulations, and usability where network connection is a concern. Llama models compressed by picoLLM Compression are small enough that they are able to run on most iOS devices.

picoLLM Inference also runs on Android, Linux, Windows, macOS, Raspberry Pi, and Web Browsers. If you want to run Llama across platforms, check out other tutorials: Llama on Android, Llama for Desktop Apps and Llama within Web Browsers.

In just a few lines of code, you can start performing LLM inference using the picoLLM Inference iOS SDK. Let’s get started!

Before Running Llama on iOS


The following tools are required before setting up an iOS app:

Install picoLLM Packages

Create a project and install picoLLM-iOS using CocoaPods. Add the following to your app’s Podfile:

Replace ${PROJECT_TARGET} with your app’s target name. Then run the following:

Sign up for Picovoice Console

Next, create a Picovoice Console account, and copy your AccessKey from the main dashboard. Creating an account is free, and no credit card is required!

Downloading the picoLLM compressed Llama 2 or 3 Model File

Download any of the Llama 2 or Llama 3 picoLLM model files (.pllm) from the picoLLM page on Picovoice Console.

Model files are also available for other open weight models such as Gemma, Mistral, Mixtral and Phi 2.

The model needs to be transferred to the device, there are several ways to do this depending on the application use case. For testing, it is best to host the model file externally for download or copy the model file directly into the phone using AirDrop or USB.

Building a Simple iOS Application

Create an instance of picoLLM with your AccessKey and model file (.pllm):

Pass in your prompt to the generate function. You may also use streamCallback to provide a function that handles response tokens as soon as they are available:

There are many configuration options in addition to streamCallback. For the full list of available options, check out the picoLLM Inference API docs.

For a complete working project, take a look at the picoLLM Completion iOS Demo or the picoLLM Chat iOS Demo. You can also view the picoLLM Inference iOS API docs for complete details on the iOS SDK.