In recent years, the field of Artificial Intelligence (AI) has witnessed unprecedented breakthroughs, spearheaded by the advancements of Large Language Models (LLMs).
LLMs, such as Llama 2
and Llama 3
, are capable of processing and generating human-like language, which is revolutionizing the way we interact with machines and access information.
With their ability to engage in context-specific conversations, respond to nuanced queries, and even exhibit creativity, LLMs can augment a wide range of applications, from chatbots and virtual assistants to content generation and summarization, offering developers a powerful tool to unlock new possibilities with natural language processing.
To achieve their impressive results, LLMs require massive amounts of memory, storage, and computational resources, making them virtually impossible to run locally on almost all consumer devices. While most services side-step this issue by utilizing cloud computing to perform the computation off-device, this introduces a whole new set of problems, such as network latency and privacy concerns.
However, there is a solution.
Picovoice’s picoLLM Inference enables offline LLM inference that is fast, flexible and efficient.
Supporting both CPU
and GPU
acceleration on Windows
, macOS
and Linux
, picoLLM Inference Engine
can run LLMs on a wide range of devices, from resource-constrained edge devices to powerful workstations, without relying on cloud infrastructure.
With the help of picoLLM Compression, compressed Llama 2
and Llama 3
models are small enough to even run on Raspberry Pi
.
picoLLM Inference Engine
also runs on Android
, iOS
and Web Browsers
.
In just a few lines of code, we will show you how you can run LLM inference with Llama 2
and Llama 3
using the picoLLM Inference Engine
Python SDK.
Before Running Llama with Python
Install Python and picoLLM Package
Install Python (version 3.8 or higher) and ensure it is successfully installed:
Install the picollm Python package using PIP:
Sign up for Picovoice Console
Create a Picovoice Console account and copy your AccessKey
from the dashboard. Creating an account is free, and no credit card is required.
Download a picoLLM Compressed Llama Model File
From the picoLLM console page, download any Llama 2
or Llama 3
picoLLM
model file (.pllm
) and place the file in your project directory.
Building a Simple Python Application with Llama
Create an instance of the picoLLM Inference Engine
with your AccessKey
and model file path (.pllm
):
The picoLLM
Python SDK supports running on both CPU
and GPU
. By default, the most suitable device is selected, however, we can manually select any device using the device
argument:
Pass in your text prompt to the generate
function and print out Llama’s response.
You can also use the stream_callback
argument to provide a function that handles response tokens as soon as they are available.
generate
includes many other configuration arguments to help tailor responses for specific use cases. For the full list of arguments, check out the picoLLM Inference API docs.
When done, make sure to release the engine instance:
For a complete working project, check out the picoLLM Python Demo. You can also view the picoLLM Inference Python API docs for complete details on the Python SDK.