picoLLM On-device LLM Platform

On-device LLMs that work beyond the demo

Every LLM PoC looks production-ready until production. picoLLM is the only local LLM platform that delivers enterprise-grade deployment, compression, and reliability for products that actually matter.

Start Free Contact Sales

Model used: Llama 3.2

Hello, Llama!

Hello! Start the demo to begin a conversation.

Loved by developers, trusted by enterprises

What is picoLLM On-device LLM Platform?

picoLLM is the end-to-end on-device large language model (LLM) platform that enables enterprises to build AI assistants running locally across mobile, web, desktop, and embedded devices without sacrificing accuracy.

picoLLM features a compression algorithm that quantizes custom LLMs for local deployment, an on-device inference engine for deploying quantized LLMs across platforms, and a compression-aware small language model (SLMs) training platform.

Get started with just a few lines of code

1o = picollm.create(
2    access_key,
3    model_path)
4
5res = o.generate(prompt)

1const o = new PicoLLM(
2    accessKey,
3    modelPath);
4
5const res = o.generate(prompt);

1PicoLLM o = new PicoLLM.Builder()
2  .setAccessKey(accessKey)
3  .setModelPath(modelPath)
4  .build();
5
6PicoLLMCompletion res = o.generate(
7  prompt,
8  new PicoLLMGenerateParams
9    .Builder()
10    .build());

1let o = try PicoLLM(
2    accessKey: accessKey,
3    modelPath: modelPath)
4
5let res = o.generate(
6    prompt: prompt)

1const o = await PicoLLMWorker.create(
2  accessKey,
3  modelFile
4);
5
6const res = await o.generate(prompt);

1PicoLLM o = PicoLLM.Create(
2    accessKey: accessKey,
3    modelPath: modelPath)
4
5PicoLLMCompletion res =
6  o.Generate(prompt);

1pv_picollm_t *pllm = NULL;
2pv_status_t status = pv_picollm_init(
3    accessKey,
4    modelPath,
5    "best",
6    &pllm);
7
8pv_picollm_usage_t usage;
9pv_picollm_endpoint_t endpoint;
10pv_picollm_completion_token_t *ct;
11int32_t num_ct;
12char *output;
13pv_picollm_generate(
14    pllm,
15    prompt,
16    -1,    // completion_token_limit
17    NULL,  // stop_phrases
18    0,     // num_stop_phrases
19    -1,    // seed
20    0.f,   // presence_penalty
21    0.f,   // frequency_penalty
22    0.f,   // temperature
23    1.f,   // top_p
24    0,     // num_top_choices
25    NULL,  // stream_callback
26    NULL,  // stream_callback_context
27    &usage,
28    &endpoint,
29    &ct,
30    &num_ct,
31    &output);

Start Free View Docs

Why choose picoLLM over other On-device LLM Inference Engines?

picoLLM features the only cross-platform LLM inference engine optimized for both compute and memory constraints. picoLLM minimizes language models' size and runtime memory requirements while maximizing deployment reach from mobile to web to embedded.

picoLLM runs across Linux, Windows, macOS, Android, iOS, Chrome, Safari, Edge, Firefox, and Raspberry Pi, and other embedded systems, supporting enterprises' entire product portfolio and any future expansions.

Why Choose picoLLM over other local LLM tools?

Run LLMs on CPU and GPU

picoLLM uses a proprietary compression algorithm tailored for enterprise LLM applications. It automatically learns optimal bit allocation, outperforming standard quantization methods like GPTQ, proven by an open-source benchmark.

picoLLM Compression recovers MMLU score degradation of widely adopted GPTQ by 91%, 99%, and 100% at 2, 3, and 4-bit settings.

LLM Inference on CPU

Look no further for the best CPU to run LLMs locally on the device. picoLLM runs any LLMs on any CPU.

LLMs quantized by picoLLM

Check out this on-device voice assistant that runs on a CPU. The demo uses picoLLM, Porcupine Wake Word, Cheetah Streaming Speech-to-Text, and Orca Streaming Text-to-Speech. No data is sent to 3rd parties. No lags. No delays.

Unlimited LLM Inference on a CPU

LLM Inference on GPU

Pinnacle LLMs demand data center GPU clusters with 100s GB of VRAM. - not with picoLLM.

Check out this demo running picoLLM, Porcupine Wake Word, Cheetah Streaming Speech-to-Text, and Orca Streaming Text-to-Speech on an RTX GPU.

Unlimited LLM Inference on a GPU

LLM Inference on Android

Mobile apps using cloud-dependent LLM APIs are at the mercy of ISPs and server providers. Bad reception causes service disruptions, and inefficient local models drain batteries, hindering UX. picoLLM Inference does neither.

Check out this demo running picoLLM, Porcupine Wake Word, Cheetah Streaming Speech-to-Text, and Orca Streaming Text-to-Speech on Android.

LLM Inference on iOS

Apple runs Apple Intelligence locally on iOS for a reason. Privacy, reliability, and latency matter at all stages, and cost becomes an issue at scale. - even if you're one of the richest companies in the world. picoLLM enables enterprises to run LLMs locally on device, just like Apple Intelligence, without having the deep learning expertise that only a few companies like Apple can afford.

Learn how to build an on-device voice assistant for iOS using picoLLM, Porcupine Wake Word, Cheetah Streaming Speech-to-Text, and Orca Streaming Text-to-Speech.

LLM Inference within Web Browsers

Finally found an inference engine that runs LLMs in Chrome, but it doesn't support Safari? The one that supports Safari doesn't support Edge? picoLLM Inference runs across all modern browsers.

Check out this demo running picoLLM, Porcupine Wake Word, Cheetah Streaming Speech-to-Text, and Orca Streaming Text-to-Speech on an RTX GPU.

LLM Inference on Embedded

Interested in leveraging Generative AI and LLMs in IoT, but not able to find an inference engine that is efficient to run LLMs on embedded devices? picoLLM Inference can easily run quantized LLMs locally on single-board computers.

LLM Inference in the Serverless

picoLLM enables serverless LLM inference for scalable and low-ops deployment on any cloud provider, including private clouds.

Learn to deploy Meta's Llama-3-8b on AWS Lambda using picoLLM.

Get started with

picoLLM On-device LLM Platform

The best way to learn about picoLLM is to use it!

Start Free

pico.LLAMA, pico.Gemma, pico.Mistral, pico.Mixtral, pico.Phi
LLM inference in the public & private cloud, on-prem, on desktop, on mobile, within web browsers, on-device
Intuitive SDKs

Everything You Need to Know About Speech-to-Speech Translation

How do Voice AI Agents work?

Future of Generative AI: Small Language Models

Power of LLM Quantization: Making Large Language Models Smaller and Efficient

Local LLM for Mobile: Run Llama 2 and Llama 3 on iOS

Local LLM for Desktop Applications: Run Llama 2 & Llama 3 in Python

Frequently asked questions

Feature

Usage

Technical Questions

Custom Models & Support

Data Security & Privacy

Building with picoLLM

Feature

What are the key benefits of picoLLM On-device LLM Platform?

Works with any LLM – Custom, proprietary, or open-weight (e.g., Llama, Gemma, Phi)
Runs anywhere – Web, mobile, desktop, embedded, and serverless
Compressed & optimized – Reduces storage and memory needs without losing accuracy
Fully private – All inference happens on-device, no external servers
Production-ready – No tuning or ML expertise required

What does the picoLLM Platform offer?

With more capabilities coming soon, the initial release of picoLLM offers:

picoLLM Inference: Cross-platform inference engine
picoLLM Compression: Next-gen compression algorithm engine
picoLLM GYM: Compression-aware small language models training

What does picoLLM Compression do?

picoLLM Compression is a novel large language model (LLM) quantization algorithm developed within Picovoice. Existing techniques require a fixed bit allocation scheme, which is subpar. Given a task-specific cost function, picoLLM Compression automatically learns the optimal bit allocation strategy across and within LLM's weights.

What does the picoLLM Inference do?

picoLLM Inference runs X-bit quantized LLMs, simplifying the development process to add LLMs to any software. picoLLM Inference is the only local LLM inference engine

across Linux, macOS, Windows, Android, iOS, Raspberry Pi, Chrome, Safari, Edge, and Firefox
supports CPU and GPU out-of-the-box and has the architecture to tap into other forms of accelerated computing
runs any LLM architecture

Does picoLLM offer local Llama models?

Yes, picoLLM offers ready-to-use quantized Llama models. Quantized Llama models can be downloaded from Picovoice Console within your plan limits, and deployed locally across platforms.

Does picoLLM offer local Mistral models?

Yes, picoLLM offers ready-to-use quantized Mistral models. Quantized Mistral models can be downloaded from Picovoice Console within your plan limits, and deployed locally across platforms.

Does picoLLM offer local Mixtral models?

Yes, picoLLM offers ready-to-use quantized Mixtral models. Quantized Mixtral models can be downloaded from Picovoice Console within your plan limits, and deployed locally across platforms.

Does picoLLM offer local Microsoft Phi models?

Yes, picoLLM offers ready-to-use quantized Microsoft Phi-2. Quantized Microsoft Phi-2 models can be downloaded from Picovoice Console within your plan limits, and deployed locally across platforms.

Does picoLLM offer local Gemma models?

Yes, picoLLM offers ready-to-use quantized Gemma models. Quantized Gemma models can be downloaded from Picovoice Console within your plan limits, and deployed locally across platforms.

How can I get access to picoLLM GYM to train small language models?

Currently, picoLLM GYM is only open to selected enterprise customers. Please engage with your account manager if you're already a Picovoice customer.

Usage

What are the platforms supported by picoLLM Inference?

Desktop & Server: Linux, Windows & macOS
Mobile: Android & iOS
Web Browsers: Chrome, Safari, Edge and Firefox
Single Board Computers: Raspberry Pi
Cloud Providers: AWS, Azure, Google, IBM, Oracle and others.

Does picoLLM Inference run LLMs in the public or private cloud, including VPC(Virtual Private Cloud)?

Yes, picoLLM is cloud-agnostic and interoperable. You can deploy LLMs in the cloud, work with the cloud provider of your choice, and easily move from one to another.

Does picoLLM Inference run LLMs in the serverless?

Yes, you can deploy LLMs in the serverless working with the cloud provider of your choice, and easily move from one to another.

Does picoLLM Inference run LLMs on-prem?

Yes, you can run LLMs on-prem with picoLLM.

Does picoLLM Inference run LLMs on mobile devices?

Yes, you can run LLMs on mobile devices. picoLLM supports both Android and iOS.

Does picoLLM Inference run LLMs within web browsers?

Yes, you can run LLMs within web browsers. picoLLM supports all modern web browsers - Chrome, Safari, Firefox, and Edge.

Does picoLLM Inference run LLMs on embedded devices?

Yes, you can run LLMs on embedded devices.

Where's user data stored?

picoLLM doesn't track, access, or store user data.

Why does picoLLM Inference require an AccessKey (i.e., internet connectivity) if the engine processes data locally?

All Picovoice engines, including picoLLM Inference, use AccessKey to serve you within your plan limits.

Technical Questions

What are the advantages of using quantized models over non-quantized models?

There are several advantages of running quantized models:

Reduced Model Size: Quantization decreases the model sizes of large language models, resulting in
- Smaller download size: Quantized LLMs require less time and bandwidth to download. For example, a mobile app using a large-sized language model may not be approved to be on the Apple Store.
- Smaller storage size: Quantized LLMs occupy less storage space. For example, an Android app using a small language model will take up less storage space, improving the usability of your application, and the experience of users.
- Less memory usage: Quantized LLMs use less RAM, which speeds up LLM inference and your application and frees up memory for other parts of your application to use, resulting in better performance and stability.
Reduced Latency: Compute latency and network latency consist of the total latency.
- Reduced Compute Latency: Compute latency is the time between a machine receiving a request and the moment returning a response. LLMs require powerful infrastructure to run with minimal compute latency. Otherwise, it may take minutes, even hours, or days to respond. Reduced computational requirements allow quantized LLMs to respond faster given the same resources (reduces latency) or to achieve the same latency using fewer resources.
- Zero Network Latency: Network latency, delay, or lag shows the time that data takes to transfer across the network. Since quantized LLMs can run where the data is generated rather than requiring data to be sent to a 3rd party cloud, there is no need for the data transfer, hence zero network latency.

Quantization can be used to reduce the size of models and latency potentially at the expense of some accuracy. Choosing the right quantized model is important to ensure small to no accuracy loss. Our Deep Learning Researchers explain why picoLLM Compression is different from other quantization techniques.

Is picoLLM open-source?

picoLLM SDKs are open-source and available via Picovoice's GitHub and SDK-specific package managers.

We're currently working on open-sourcing the picoLLM Inference, making the picoLLM compression algorithm available on the Picovoice Console, as well as adding new capabilities to the picoLLM platform to improve the developer experience.

How accurate is picoLLM Compression?

We compare picoLLM Compression algorithm accuracy against popular quantization techniques. Ceteris paribus -at a given size and model. - picoLLM offers better accuracy than the popular quantization techniques, such as AWQ, GPTQ, LLM.int8(), and SqueezeLLM. You can check the open-source compression benchmark to compare the performance of picoLLM Compression against GPTQ.

Please note that there is no only one widely framework used to evaluate LLM accuracy as LLMs are relatively new and capable of performing various tasks. One metric can be more important for a certain task, and irrelevant to others. Taking “accuracy” metrics at face value, and comparing two figures calculated in different settings may lead to wrong conclusions. Also, picoLLM Compression's value add is retaining the original quality while making LLMs available across platforms, i.e., offering the most efficient models without sacrificing accuracy, not offering the most accurate model. We highly encourage enterprises to compare the accuracy against the original models, e.g., llama-2 70B vs. pico.llama-2 70B at different sizes.

How does picoLLM Compression differ from other compression techniques such as AWQ, GPTQ, LLM.int8(), and SqueezeLLM?

Quantization techniques, such as AWQ, GPTQ, LLM.int8(), and SqueezeLLM are developed by researchers for research. picoLLM is developed by researchers for production to enable enterprise-grade applications.

At any given size, picoLLM retains more of the original quality. In other words, picoLLM compresses models more efficiently than the others, offering efficient models without sacrificing accuracy compared to these techniques. Read more from our deep learning research team about our approach to LLM quantization.

How fast is picoLLM?

The smaller the models and more powerful the systems are, the faster language models run.

Speed tests (token/second) are generally done in a controlled environment and, unsurprisingly, in favor of the model/vendor. Several factors, hardware (GPU, CPU, RAM, motherboard, original size of the models) and software (background processes and programs), language model, and so on affect the speed.

At Picovoice, our communication has always been fact-based and scientific. Since speed tests are easy to manipulate and it's impossible to create a reproducible framework we cannot publish any metrics. We strongly suggest everyone run their own tests in their environment.

How does picoLLM Inference differ from other inference engines?

picoLLM Inference is specifically developed for the picoLLM platform.

Existing inference engines can handle models with known bit distribution (4 or 8-bit) across model weights. picoLLM-compressed weight contains 1, 2, 3, 4, 5, 6, 7, and 8-bit quantized parameters to retain intelligence while minimizing the model size. Hence existing inference engines built for pre-defined bit distribution are not able to match the dynamic nature of picoLLM.

Read more from our engineering team who explained why and how we developed picoLLM Inference engine.

Can I use picoLLM offerings with another LLM Inference engine?

There are three major issues with the existing LLM inference engines.

They are not versatile. They only support certain platforms or model types.
They are not ready-to-use, requiring machine learning knowledge.
They cannot handle X-bit quantization, as this innovative approach is unique to picoLLM Compression.

HuggingFace transformers work with transformers only. TensorFlow Serving works with TensorFlow models only and has a steep learning curve to get started. TorchServe is designed for Pytorch and integrates well with AWS. NVIDIA Triton Inference Server is designed for NVIDIA GPUs only. OpenVINO is optimized for Intel hardware. In reality, your software can and will be run on different platforms. That's why we had to develop picoLLM Inference. It's the only ready-to-use and hardware-agnostic engine.

Custom Models & Support

Do you train custom LLM models? Can I fine-tune picoLLM models?

Yes, at the moment custom training is available through picoLLM GYM for selected enterprise customers. Please engage with your account manager if you’re already a Picovoice customer.

How do custom large language models compare with general open LLMs?

Custom LLMs are created for specific tasks and specific use cases. General-purpose large language models are jacks-of-all-trades and masters-of-none. In other words, they can help a student with their homework but not a knowledge worker with company-specific information.

General-purpose LLMs are offered by foundation model providers, such as OpenAI, Google, Meta, Microsoft, Cohere, Anthropic, Mistral, Databricks, and so on. They're good at developing products such as chatbots, translation services, and content creation apps. Developers building hobby projects, one-size-fits-all applications, or with no access to training datasets can choose general-purpose LLMs.

Custom LLMs can offer distinctive feature sets and increased domain expertise, resulting in unmatched precision and relevance. Hence, custom LLMs have become popular in enterprise applications in several industries, including healthcare, law, and finance. They're used in various applications, such as medical diagnosis, legal document analysis, and financial risk assessment. Unlike general-purpose LLMs, custom LLMs are not ready to use, they require special training that leverages domain-specific data to perform better in certain use cases.

Why shouldn't we just use big vendors' closed-source models, such as GPT-4 or Claude?

If you think they're a better fit, you should. Especially, in the beginning, to have an understanding of what LLMs can achieve, using an API can be a better approach as control over data, model, infrastructure or inference cost is a concern. Closed-source model drawbacks become a concern when enterprises want to have control over their specific use case. If customizability, privacy, ownership, reliability, or inference cost at scale is a concern, then you should be more cautious about choosing a closed-source model.

Customizability: Each vendor has different criteria and processes to develop custom models. In order to send an inquiry to OpenAI, one has to acknowledge that it may take months to train custom models and pricing starts at $2-3million.
Privacy: The default business model for closed-source models is to run inference in the cloud. Hence it requires enterprises to send their user data and confidential information to the cloud.
Ownership: You never have ownership of a closed-source model. If your LLM is critical for the success of your product, or in other words, if you view your LLM as an asset rather than a simple tool, it should be owned and controlled by you.
Reliability: You are at the mercy of closed-source model providers. When their API goes down or has an increase in traffic, the performance of your software, hence user experience and productivity, is negatively affected.
Cost at scale: Cloud computing at scale is costly. That's why cloud repatriation has become popular among large enterprises. Large Language Model APIs are not different if not more costlier given the size of the models. If your growth estimation involves high-volume inference, do your math carefully.

We have a custom LLM, how can we use the picoLLM Compression?

Yes. Picovoice Consulting works with Enterprise Plan customers to compress their custom or fine-tuned LLMs using the picoLLM Compression engine.

Do I own my custom models after getting them quantized?

Yes, models trained on your private data or developed by you will be 100% yours.

Can I use picoLLM models with RAG?

Yes. picoLLM models, similar to other LLMs, can be used for complex workflows, including retrieval augmented generation (RAG). Large Language Models may struggle with knowledge retrieval while understanding which information is most relevant to each query. Moreover, it may not be optimal to re-train a model every time your knowledge base is updated. RAG applications produce more nuanced and contextually relevant outputs, allowing enterprises to feed the model with information that's always permissions-aware, recent, and relevant.

My platform is not currently supported by picoLLM or we're planning to launch new hardware. How can I get picoLLM to support it?

picoLLM platform supports the most popular and widely-used hardware and software out-of-the-box - from web, mobile, desktop, and on-prem to private cloud. However, there may be certain chipsets we do not currently support. (There are so many of them, yet only so much time and money, making it impossible to support everything.) Thus, open-sourcing the inference engine is on our roadmap. Till then, you can engage with Picovoice Consulting and get the picoLLM inference engine ported to the platform of your choice.

It seems picoLLM doesn't offer the SDK we're using in production. How can I get a new SDK added to picoLLM?

picoLLM platform supports the most popular and widely used SDKs. If you need another SDK, you can check our open-source SDKs and build it yourself or contact Picovoice Consulting. Picovoice Consulting experts can create a public or private library for the SDK of your choice and maintain it.

I am using official picoLLM demos, however, I get an error. How do I report bugs?

You can create a GitHub issue under the relevant repository/demo.

I'm a solo developer working on a hobby project and I do not have a budget to engage with Picovoice Consulting, what should I do?

Picovoice does not offer dedicated support to Forever-Free Plan users, given the number of developers building with Picovoice and the limited resources we have. You can create a GitHub issue under the relevant repository to provide us with feedback, add product enhancement ideas, or engage with other community members. We appreciate your understanding while we prioritize our enterprise customers for the continuity of our business.

Data Security & Privacy

Where does picoLLM process data?

picoLLM processes data in your environment, whether it's public or private cloud, on-prem, web, mobile, desktop, or embedded.

For how long does picoLLM retain user data?

picoLLM is private by design and has no access to user data. Thus, picoLLM doesn't retain user data as it never tracks or stores them in the first place.

Is picoLLM HIPAA-compliant?

Yes. Enterprises using picoLLM don't need to share their user data with Picovoice or any other 3rd party to run LLMs, making picoLLM intrinsically HIPAA compliant.

Is picoLLM CCPA-compliant?

Yes. Enterprises using picoLLM don't need to share their user data with Picovoice or any other 3rd party to run LLMs, making picoLLM intrinsically CCPA compliant.

Building with picoLLM

Can I use picoLLM to build a generative AI assistant that uses my company's content with enterprise-grade permissions, data governance, and referable resources?

Yes! McKinsey & Co. estimates that we spend 20% of our time looking for internal information or tracking down colleagues who can help with specific tasks. You can save your company a significant amount of time with a generative assistant without breaking the bank or jeopardizing trade secrets and confidential information. Contact Picovoice Consulting if you need a jumpstart.

How can I deploy custom language models for production?

The answer is “it depends”. Deploying LLMs for production requires diligent work. It depends on your use case, other tools, and the tech stack used, along with hardware and software choice. Given the variables, it can be challenging. Experts from Picovoice Consulting work with Enterprise Plan customers to find the best approach to deploying language models for production.

What are the best practices to develop and deploy LLM applications?

Developers have a myriad of choices while building LLM applications. Choosing the best AI models that fit the use case is a big challenge, given that there are hundreds of open-source LLMs if not thousands to start with. (Although most are the fine-tuned versions of a few base LLMs.) Best practices depend on the use cases, other tools used, and tech stack.

Unfortunately, there is no zero standardization around this subject. We encourage Picovoice community members to share their projects with [email protected] to inspire fellow developers and enterprises to contact Picovoice Consulting to work with AI experts.