🚀 Best-in-class Voice AI!
Build compliant and low-latency AI apps running entirely on mobile without sending user data to 3rd party servers.
Start Free

Large Language Models (LLMs), such as Llama 2 and Llama 3, represent significant advancements in artificial intelligence, fundamentally changing how we are able to interact with technology. These AI models can understand and generate human-like text, making them useful for applications like chatbots, voice assistants, and natural language processing.

One major drawback of large language models (LLMs) is their high resource requirements. However, by running cross-platform (Windows, Mac, Linux) offline LLMs, these models can leverage robust CPUs and GPUs across different systems. Using local hardware eliminates network latency issues and also addresses privacy concerns, as data stays on the user's device. By running models like Llama 2 or Llama 3 locally, users gain enhanced privacy, reliability, and efficiency without needing an internet connection.

Picovoice's picoLLM Inference engine makes it easy to perform offline LLM inference. picoLLM Inference is a lightweight inference engine that operates locally, ensuring privacy compliance with GDPR and HIPAA regulations. Llama models, compressed by picoLLM Compression, are ideal for real-time applications given their smaller footprint.

In just a few lines of code, you can start performing LLM inference using the picoLLM Inference Node.js SDK. Let’s get started!

Before Running Llama with Node.js

Install Packages

Create a project and install @picovoice/picollm-node.

Sign up for Picovoice Console

Next, create a Picovoice Console account, and copy your AccessKey from the main dashboard. Creating an account is free, and no credit card is required!

Model File

Download any Llama 2 or Llama 3 model file (.pllm) from the picoLLM page on Picovoice Console and place the file in your project.

Building a Simple Application with Llama

Create an instance of picoLLM with your AccessKey and model file (.pllm):

Pass in your prompt to the generate function. You may also pass in a streamCallback function to handle response tokens as soon as they are available:

There are many configuration options in addition to streamCallback. For the full list of options, check out the picoLLM Inference API docs.

When done, be sure to release the resources explicitly:

For a complete working project, take a look at the picoLLM Inference Node.js Demo. You can also view the picoLLM Inference Node.js API docs for details on the package.