In recent years, the field of Artificial Intelligence (AI) has witnessed unprecedented breakthroughs, spearheaded by the advancements of Large Language Models (LLMs).

LLMs, such as Llama 2 and Llama 3, are capable of processing and generating human-like language, which is revolutionizing the way we interact with machines and access information. With their ability to engage in context-specific conversations, respond to nuanced queries, and even exhibit creativity, LLMs can augment a wide range of applications, from chatbots and virtual assistants to content generation and summarization, offering developers a powerful tool to unlock new possibilities with natural language processing.

To achieve their impressive results, LLMs require massive amounts of memory, storage, and computational resources, making them virtually impossible to run locally on almost all consumer devices. While most services side-step this issue by utilizing cloud computing to perform the computation off-device, this introduces a whole new set of problems, such as network latency and privacy concerns.

However, there is a solution. Picovoice’s picoLLM Inference enables offline LLM inference that is fast, flexible and efficient. Supporting both CPU and GPU acceleration on Windows, macOS and Linux, picoLLM Inference Engine can run LLMs on a wide range of devices, from resource-constrained edge devices to powerful workstations, without relying on cloud infrastructure. With the help of picoLLM Compression, compressed Llama 2 and Llama 3 models are small enough to even run on Raspberry Pi.

picoLLM Inference Engine also runs on Android, iOS and Web Browsers.

In just a few lines of code, we will show you how you can run LLM inference with Llama 2 and Llama 3 using the picoLLM Inference Engine Python SDK.

Before Running Llama with Python

Install Python and picoLLM Package

Install Python (version 3.8 or higher) and ensure it is successfully installed:

Install the picollm Python package using PIP:

Sign up for Picovoice Console

Create a Picovoice Console account and copy your AccessKey from the dashboard. Creating an account is free, and no credit card is required.

Download a picoLLM Compressed Llama Model File

From the picoLLM console page, download any Llama 2 or Llama 3 picoLLM model file (.pllm) and place the file in your project directory.

Building a Simple Python Application with Llama

Create an instance of the picoLLM Inference Engine with your AccessKey and model file path (.pllm):

The picoLLM Python SDK supports running on both CPU and GPU. By default, the most suitable device is selected, however, we can manually select any device using the device argument:

Pass in your text prompt to the generate function and print out Llama’s response.

You can also use the stream_callback argument to provide a function that handles response tokens as soon as they are available.

generate includes many other configuration arguments to help tailor responses for specific use cases. For the full list of arguments, check out the picoLLM Inference API docs.

When done, make sure to release the engine instance:

For a complete working project, check out the picoLLM Python Demo. You can also view the picoLLM Inference Python API docs for complete details on the Python SDK.