It's an exciting time for large language models (LLMs). The recent advances are astounding, with models like GPT-4o and Llama3 demonstrating high proficiency in natural language understanding and generation, making the Turing test a quaint thing of the past. However, for developers using Node.js, there are still limited options available for integrating LLMs into Node applications. Cloud services like the OpenAI API offer access to powerful models but come with online service dependencies, privacy concerns and unbounded usage costs. For projects that want to avoid external cloud providers, packages like node-llama-cpp provide tools for running models on local hardware, but comes with convoluted setup requirements, significant size vs. accuracy trade-offs, and minimal platform support.

picoLLM aims to address all the issues of its online and offline LLM predecessors with its novel x-bit LLM quantization and cross-platform local LLM inference engine. Offering hyper-compressed versions of Llama3, Gemini, Phi-2, Mixtral, and Mistral, picoLLM enables developers to deploy these popular open-weight models on nearly any consumer device. Now, let's see what it takes to run a local LLM using Node.js!

The picoLLM Inference Engine is a cross-platform library that supports Windows, macOS, Linux, Raspberry Pi, Android, iOS and Web Browsers. picoLLM has SDKs for Python, Node.js, Android, iOS, and JavaScript.

The following guide will walk you through the steps required to run a local LLM using Node.js.


  1. Install Node.js.

  2. Make a new directory and init a new Node project in it:

  1. Add the @picovoice/picollm-node npm package to the project:
  1. Go to Picovoice Console, retrieve your AccessKey, and choose a picoLLM model file (.pllm) to download.

Running an LLM Inference with Node.js

To get started we'll need to create a new Javascript file, index.js. Open it up in the editor of your choice, import the picollm package and initialize the inference engine using your AccessKey and the path to your downloaded .pllm model file.

With the inference engine initialized, we can now use it to generate a basic prompt completion and print it to the console:

It will take a few seconds to generate the full completion. IF we want to stream the results, we can use the streamCallback and output to process.stdout in order to not write a new console line each print:

When done, be sure to release the resources explicitly:

The Node.js API documentation details additional options for generation as well as chat templates that can be used to have back-and-forth chats with certain models.

Node.js LLM Demos

The picoLLM GitHub Repository contains two demos to get you started: completion and chat. The completion demo accepts a prompt and a set of optional parameters and generates a single completion. It can run all models, whether instruction-tuned or not. The chat demo can run instruction-tuned (chat) models. The chat demo enables a back-and-forth conversation with the LLM, similar to ChatGPT.

To try out either of the demos, you can simply install the picoLLM demo package:

This will install command-line utilities for running the demos. To run the chat demo: