picoLLM Inference

LLM Inference on embedded, mobile, web, desktop, on-prem & cloud

The only cross-platform Local LLM Inference Engine supports all LLM architectures.

Hello, Llama!

Hello! Start the demo to begin a conversation.

Loved by developers, trusted by enterprises

What is picoLLM Inference?

picoLLM Inference is the cross-platform local LLM inference engine that runs large language models created on the picoLLM platform across Linux, macOS, Windows, Android, iOS, Chrome, Safari, Edge, Firefox, Raspberry Pi, or other embedded platforms, supporting both CPU and GPU.

Cross-platform LLM SDKs

o = picollm.create(
    access_key,
    model_path)

res = o.generate(prompt)
Build with Python

Why picoLLM Inference?

Finding a local LLM inference engine that fits enterprises’ large product portfolio and performance expectations is challenging, if not impossible. NVIDIA supports NVIDIA GPUs, Intel optimizes for Intel. ExecuTorch and Torchserve are not production-ready. picoLLM Inference is the only local inference engine:

runs any LLM architecture
is built for X-Bit quantized LLMs
brings cloud API convenience to local deployment with ready-to-use intuitive SDKs without needing machine learning expertise
across Linux, macOS, Windows, Android, iOS, Chrome, Safari, Edge, Firefox, and Raspberry Pi or other embedded systems
supports CPU and GPU out-of-the-box and has the architecture to tap into other forms of accelerated computing

Picovoice engineering team explained what makes picoLLM inference unique in detail.

Run LLMs locally on CPU

Looking for recommended hardware for running LLMs locally on a CPU? Don’t look further. picoLLM Inference runs LLMs on any CPU.

Run LLMs locally on Consumer GPU

Pinnacle LLMs demand data center GPU clusters with 100s GB of VRAM. - not with picoLLM.

Run LLMs locally on Mobile

Mobile apps using cloud-dependent LLM APIs are at the mercy of internet service and LLM providers. Inefficient local models drain resources. picoLLM Inference does neither.

Run LLMs locally within Web Browsers

Finally found an inference engine that runs LLMs in Chrome, but it doesn’t support Safari? The one that supports Safari doesn’t support Edge? picoLLM Inference supports all modern browsers.

Model used: Llama 3.2

Hello, Llama!

Hello! Start the demo to begin a conversation.

Run LLMs locally on Embedded

Interested in leveraging Generative AI And LLMs in IoT, but not able to find an inference engine that is efficient to run LLMs on embedded? picoLLM Inference runs LLMs locally on single-board computers.

Run LLMs in the Serverless

picoLLM enables serverless LLM inference for scalable and low-ops deployment on any cloud provider, including private clouds.

picoLLM Inference is free to use without any usage limits - whether you work on a PoC or serve millions of users.

Free LLM Inference

Start Free

Run any Language Model

picoLLM Inference runs all LLM architectures and seamlessly integrates any language models created on the picoLLM Platform. Bring your own language model or deploy ready-to-use open-weight LLMs.

👧
pico.Gemma
🦙
pico.Llama
φ
pico.Phi

⛵
pico.Mistral
🍸
pico.Mixtral
🃏
Custom

FAQ

Feature

Usage

Technical Questions

Custom Models & Support

Data Security & Privacy

Building with picoLLM

Feature

What does the picoLLM Platform offer?

With more capabilities coming soon, the initial release of picoLLM offers:

picoLLM Inference: Cross-platform inference engine
picoLLM Compression: Next-gen compression algorithm engine
picoLLM GYM: Compression-aware small language models training

What does picoLLM Compression do?

picoLLM Compression is a novel large language model (LLM) quantization algorithm developed within Picovoice. Existing techniques require a fixed bit allocation scheme, which is subpar. Given a task-specific cost function, picoLLM Compression automatically learns the optimal bit allocation strategy across and within LLM's weights.

What does the picoLLM Inference do?

picoLLM Inference runs X-bit quantized LLMs, simplifying the development process to add LLMs to any software. picoLLM Inference is the only local LLM inference engine

across Linux, macOS, Windows, Android, iOS, Raspberry Pi, Chrome, Safari, Edge, and Firefox
supports CPU and GPU out-of-the-box and has the architecture to tap into other forms of accelerated computing
runs any LLM architecture

Does picoLLM offer local Llama models?

Yes, picoLLM offers ready-to-use quantized Llama models. Quantized Llama models can be downloaded from Picovoice Console within your plan limits, and deployed locally across platforms.

Does picoLLM offer local Mistral models?

Yes, picoLLM offers ready-to-use quantized Mistral models. Quantized Mistral models can be downloaded from Picovoice Console within your plan limits, and deployed locally across platforms.

Does picoLLM offer local Mixtral models?

Yes, picoLLM offers ready-to-use quantized Mixtral models. Quantized Mixtral models can be downloaded from Picovoice Console within your plan limits, and deployed locally across platforms.

Does picoLLM offer local Microsoft Phi models?

Yes, picoLLM offers ready-to-use quantized Microsoft Phi-2. Quantized Microsoft Phi-2 models can be downloaded from Picovoice Console within your plan limits, and deployed locally across platforms.

Does picoLLM offer local Gemma models?

Yes, picoLLM offers ready-to-use quantized Gemma models. Quantized Gemma models can be downloaded from Picovoice Console within your plan limits, and deployed locally across platforms.

How can I get access to picoLLM GYM to train small language models?

Currently, picoLLM GYM is only open to selected enterprise customers. Please engage with your account manager if you're already a Picovoice customer.

Usage

What are the platforms supported by picoLLM Inference?

Desktop & Server: Linux, Windows & macOS
Mobile: Android & iOS
Web Browsers: Chrome, Safari, Edge and Firefox
Single Board Computers: Raspberry Pi
Cloud Providers: AWS, Azure, Google, IBM, Oracle and others.

Does picoLLM Inference run LLMs in the public or private cloud, including VPC(Virtual Private Cloud)?

Yes, picoLLM is cloud-agnostic and interoperable. You can deploy LLMs in the cloud, work with the cloud provider of your choice, and easily move from one to another.

Does picoLLM Inference run LLMs in the serverless?

Yes, you can deploy LLMs in the serverless working with the cloud provider of your choice, and easily move from one to another.

Does picoLLM Inference run LLMs on-prem?

Yes, you can run LLMs on-prem with picoLLM.

Does picoLLM Inference run LLMs on mobile devices?

Yes, you can run LLMs on mobile devices. picoLLM supports both Android and iOS.

Does picoLLM Inference run LLMs within web browsers?

Yes, you can run LLMs within web browsers. picoLLM supports all modern web browsers - Chrome, Safari, Firefox, and Edge.

Does picoLLM Inference run LLMs on embedded devices?

Yes, you can run LLMs on embedded devices.

Where's user data stored?

picoLLM doesn't track, access, or store user data.

Why does picoLLM Inference require an AccessKey (i.e., internet connectivity) if the engine processes data locally?

All Picovoice engines, including picoLLM Inference, use AccessKey to serve you within your plan limits.

Technical Questions

What are the advantages of using quantized models over non-quantized models?

There are several advantages of running quantized models:

Reduced Model Size: Quantization decreases the model sizes of large language models, resulting in
- Smaller download size: Quantized LLMs require less time and bandwidth to download. For example, a mobile app using a large-sized language model may not be approved to be on the Apple Store.
- Smaller storage size: Quantized LLMs occupy less storage space. For example, an Android app using a small language model will take up less storage space, improving the usability of your application, and the experience of users.
- Less memory usage: Quantized LLMs use less RAM, which speeds up LLM inference and your application and frees up memory for other parts of your application to use, resulting in better performance and stability.
Reduced Latency: Compute latency and network latency consist of the total latency.
- Reduced Compute Latency: Compute latency is the time between a machine receiving a request and the moment returning a response. LLMs require powerful infrastructure to run with minimal compute latency. Otherwise, it may take minutes, even hours, or days to respond. Reduced computational requirements allow quantized LLMs to respond faster given the same resources (reduces latency) or to achieve the same latency using fewer resources.
- Zero Network Latency: Network latency, delay, or lag shows the time that data takes to transfer across the network. Since quantized LLMs can run where the data is generated rather than requiring data to be sent to a 3rd party cloud, there is no need for the data transfer, hence zero network latency.

Quantization can be used to reduce the size of models and latency potentially at the expense of some accuracy. Choosing the right quantized model is important to ensure small to no accuracy loss. Our Deep Learning Researchers explain why picoLLM Compression is different from other quantization techniques.

Is picoLLM open-source?

picoLLM SDKs are open-source and available via Picovoice's GitHub and SDK-specific package managers.

We're currently working on open-sourcing the picoLLM Inference, making the picoLLM compression algorithm available on the Picovoice Console, as well as adding new capabilities to the picoLLM platform to improve the developer experience.

How accurate is picoLLM Compression?

We compare picoLLM Compression algorithm accuracy against popular quantization techniques. Ceteris paribus -at a given size and model. - picoLLM offers better accuracy than the popular quantization techniques, such as AWQ, GPTQ, LLM.int8(), and SqueezeLLM. You can check the open-source compression benchmark to compare the performance of picoLLM Compression against GPTQ.

Please note that there is no only one widely framework used to evaluate LLM accuracy as LLMs are relatively new and capable of performing various tasks. One metric can be more important for a certain task, and irrelevant to others. Taking “accuracy” metrics at face value, and comparing two figures calculated in different settings may lead to wrong conclusions. Also, picoLLM Compression's value add is retaining the original quality while making LLMs available across platforms, i.e., offering the most efficient models without sacrificing accuracy, not offering the most accurate model. We highly encourage enterprises to compare the accuracy against the original models, e.g., llama-2 70B vs. pico.llama-2 70B at different sizes.

How does picoLLM Compression differ from other compression techniques such as AWQ, GPTQ, LLM.int8(), and SqueezeLLM?

Quantization techniques, such as AWQ, GPTQ, LLM.int8(), and SqueezeLLM are developed by researchers for research. picoLLM is developed by researchers for production to enable enterprise-grade applications.

At any given size, picoLLM retains more of the original quality. In other words, picoLLM compresses models more efficiently than the others, offering efficient models without sacrificing accuracy compared to these techniques. Read more from our deep learning research team about our approach to LLM quantization.

How fast is picoLLM?

The smaller the models and more powerful the systems are, the faster language models run.

Speed tests (token/second) are generally done in a controlled environment and, unsurprisingly, in favor of the model/vendor. Several factors, hardware (GPU, CPU, RAM, motherboard, original size of the models) and software (background processes and programs), language model, and so on affect the speed.

At Picovoice, our communication has always been fact-based and scientific. Since speed tests are easy to manipulate and it's impossible to create a reproducible framework we cannot publish any metrics. We strongly suggest everyone run their own tests in their environment.

How does picoLLM Inference differ from other inference engines?

picoLLM Inference is specifically developed for the picoLLM platform.

Existing inference engines can handle models with known bit distribution (4 or 8-bit) across model weights. picoLLM-compressed weight contains 1, 2, 3, 4, 5, 6, 7, and 8-bit quantized parameters to retain intelligence while minimizing the model size. Hence existing inference engines built for pre-defined bit distribution are not able to match the dynamic nature of picoLLM.

Read more from our engineering team who explained why and how we developed picoLLM Inference engine.

Can I use picoLLM offerings with another LLM Inference engine?

There are three major issues with the existing LLM inference engines.

They are not versatile. They only support certain platforms or model types.
They are not ready-to-use, requiring machine learning knowledge.
They cannot handle X-bit quantization, as this innovative approach is unique to picoLLM Compression.

HuggingFace transformers work with transformers only. TensorFlow Serving works with TensorFlow models only and has a steep learning curve to get started. TorchServe is designed for Pytorch and integrates well with AWS. NVIDIA Triton Inference Server is designed for NVIDIA GPUs only. OpenVINO is optimized for Intel hardware. In reality, your software can and will be run on different platforms. That's why we had to develop picoLLM Inference. It's the only ready-to-use and hardware-agnostic engine.

Custom Models & Support

Do you train custom LLM models? Can I fine-tune picoLLM models?

Yes, at the moment custom training is available through picoLLM GYM for selected enterprise customers. Please engage with your account manager if you’re already a Picovoice customer.

How do custom large language models compare with general open LLMs?

Custom LLMs are created for specific tasks and specific use cases. General-purpose large language models are jacks-of-all-trades and masters-of-none. In other words, they can help a student with their homework but not a knowledge worker with company-specific information.

General-purpose LLMs are offered by foundation model providers, such as OpenAI, Google, Meta, Microsoft, Cohere, Anthropic, Mistral, Databricks, and so on. They're good at developing products such as chatbots, translation services, and content creation apps. Developers building hobby projects, one-size-fits-all applications, or with no access to training datasets can choose general-purpose LLMs.

Custom LLMs can offer distinctive feature sets and increased domain expertise, resulting in unmatched precision and relevance. Hence, custom LLMs have become popular in enterprise applications in several industries, including healthcare, law, and finance. They're used in various applications, such as medical diagnosis, legal document analysis, and financial risk assessment. Unlike general-purpose LLMs, custom LLMs are not ready to use, they require special training that leverages domain-specific data to perform better in certain use cases.

Why shouldn't we just use big vendors' closed-source models, such as GPT-4 or Claude?

If you think they're a better fit, you should. Especially, in the beginning, to have an understanding of what LLMs can achieve, using an API can be a better approach as control over data, model, infrastructure or inference cost is a concern. Closed-source model drawbacks become a concern when enterprises want to have control over their specific use case. If customizability, privacy, ownership, reliability, or inference cost at scale is a concern, then you should be more cautious about choosing a closed-source model.

Customizability: Each vendor has different criteria and processes to develop custom models. In order to send an inquiry to OpenAI, one has to acknowledge that it may take months to train custom models and pricing starts at $2-3million.
Privacy: The default business model for closed-source models is to run inference in the cloud. Hence it requires enterprises to send their user data and confidential information to the cloud.
Ownership: You never have ownership of a closed-source model. If your LLM is critical for the success of your product, or in other words, if you view your LLM as an asset rather than a simple tool, it should be owned and controlled by you.
Reliability: You are at the mercy of closed-source model providers. When their API goes down or has an increase in traffic, the performance of your software, hence user experience and productivity, is negatively affected.
Cost at scale: Cloud computing at scale is costly. That's why cloud repatriation has become popular among large enterprises. Large Language Model APIs are not different if not more costlier given the size of the models. If your growth estimation involves high-volume inference, do your math carefully.

We have a custom LLM, how can we use the picoLLM Compression?

Yes. Picovoice Consulting works with Enterprise Plan customers to compress their custom or fine-tuned LLMs using the picoLLM Compression engine.

Do I own my custom models after getting them quantized?

Yes, models trained on your private data or developed by you will be 100% yours.

Can I use picoLLM models with RAG?

Yes. picoLLM models, similar to other LLMs, can be used for complex workflows, including retrieval augmented generation (RAG). Large Language Models may struggle with knowledge retrieval while understanding which information is most relevant to each query. Moreover, it may not be optimal to re-train a model every time your knowledge base is updated. RAG applications produce more nuanced and contextually relevant outputs, allowing enterprises to feed the model with information that's always permissions-aware, recent, and relevant.

My platform is not currently supported by picoLLM or we're planning to launch new hardware. How can I get picoLLM to support it?

picoLLM platform supports the most popular and widely-used hardware and software out-of-the-box - from web, mobile, desktop, and on-prem to private cloud. However, there may be certain chipsets we do not currently support. (There are so many of them, yet only so much time and money, making it impossible to support everything.) Thus, open-sourcing the inference engine is on our roadmap. Till then, you can engage with Picovoice Consulting and get the picoLLM inference engine ported to the platform of your choice.

It seems picoLLM doesn't offer the SDK we're using in production. How can I get a new SDK added to picoLLM?

picoLLM platform supports the most popular and widely used SDKs. If you need another SDK, you can check our open-source SDKs and build it yourself or contact Picovoice Consulting. Picovoice Consulting experts can create a public or private library for the SDK of your choice and maintain it.

I am using official picoLLM demos, however, I get an error. How do I report bugs?

You can create a GitHub issue under the relevant repository/demo.

I'm a solo developer working on a hobby project and I do not have a budget to engage with Picovoice Consulting, what should I do?

Picovoice does not offer dedicated support to Forever-Free Plan users, given the number of developers building with Picovoice and the limited resources we have. You can create a GitHub issue under the relevant repository to provide us with feedback, add product enhancement ideas, or engage with other community members. We appreciate your understanding while we prioritize our enterprise customers for the continuity of our business.

Data Security & Privacy

Where does picoLLM process data?

picoLLM processes data in your environment, whether it's public or private cloud, on-prem, web, mobile, desktop, or embedded.

For how long does picoLLM retain user data?

picoLLM is private by design and has no access to user data. Thus, picoLLM doesn't retain user data as it never tracks or stores them in the first place.

Is picoLLM HIPAA-compliant?

Yes. Enterprises using picoLLM don't need to share their user data with Picovoice or any other 3rd party to run LLMs, making picoLLM intrinsically HIPAA compliant.

Is picoLLM CCPA-compliant?

Yes. Enterprises using picoLLM don't need to share their user data with Picovoice or any other 3rd party to run LLMs, making picoLLM intrinsically CCPA compliant.

Building with picoLLM

Can I use picoLLM to build a generative AI assistant that uses my company's content with enterprise-grade permissions, data governance, and referable resources?

Yes! McKinsey & Co. estimates that we spend 20% of our time looking for internal information or tracking down colleagues who can help with specific tasks. You can save your company a significant amount of time with a generative assistant without breaking the bank or jeopardizing trade secrets and confidential information. Contact Picovoice Consulting if you need a jumpstart.

How can I deploy custom language models for production?

The answer is “it depends”. Deploying LLMs for production requires diligent work. It depends on your use case, other tools, and the tech stack used, along with hardware and software choice. Given the variables, it can be challenging. Experts from Picovoice Consulting work with Enterprise Plan customers to find the best approach to deploying language models for production.

What are the best practices to develop and deploy LLM applications?

Developers have a myriad of choices while building LLM applications. Choosing the best AI models that fit the use case is a big challenge, given that there are hundreds of open-source LLMs if not thousands to start with. (Although most are the fine-tuned versions of a few base LLMs.) Best practices depend on the use cases, other tools used, and tech stack.

Unfortunately, there is no zero standardization around this subject. We encourage Picovoice community members to share their projects with [email protected] to inspire fellow developers and enterprises to contact Picovoice Consulting to work with AI experts.