picoLLM Compression

Unmatched Accuracy with Next-Gen LLM Quantization

The LLM compression algorithm with unmatched accuracy, reducing runtime and storage requirements of any LLM while retaining model performance

Model used: Phi-2
Hello, Phi-2!
Hello! Start the demo to begin a conversation.

What is picoLLM Compression?

picoLLM Compression is a quantization algorithm that outperforms any existing quantization techniques and speeding up LLM inference by shrinking the model.

picoLLM Compression comes with picoLLM Inference Engine, which runs on CPU and GPU across Linux, macOS, Windows, Android, iOS, Chrome, Safari, Edge, Firefox, and Raspberry Pi or other embedded systems in a few lines of code.

X-bit LLM Quantization

Never heard of X-bit LLM Quantization before? You’re not alone. It’s new and unique to Picovoice.

Existing quantization techniques require a fixed bit allocation scheme, mostly 8-bit or 4-bit. Picovoice researchers found this approach suboptimal and came up with the X-bit quantization.

picoLLM compression automatically learns the optimal bit allocation strategy and quantizes LLMs to minimize loss by allocating optimal bits across and within weights. Learn what makes picoLLM Compression unique from deep learning researchers.

Achieve Unmatched Accuracy

picoLLM Compression automatically learns the optimal bit allocation strategy, beating alternative quantization techniques, proven by an open-source benchmark.

For example, when applied to Llama-3-8B, picoLLM Compression recovers MMLU score degradation of widely adopted GPTQ by 91%, 99%, and 100% at 2, 3, and 4-bit settings.

Speed up LLM Inference with a Smaller Footprint

LLM inference is both compute and memory-bound. picoLLM eases LLMs’ storage and reduces the required memory for inference, allowing cross-platform deployment and more efficient computations.

Quantize any Large Language Model

Most popular open-weight models, Gemma, Llama, Mistral, Mixtral, and Phi, compressed by picoLLM Compression are available with pico.LLM Inference. New models will be added as foundation model developers release new and improved open-weight models, so you can always have access to the latest and greatest.
  • 👧
    pico.Gemma
  • 🦙
    pico.Llama
  •   φ
    pico.Phi
  • pico.Mistral
  • 🍸
    pico.Mixtral
  • 🃏
    Custom
picoLLM Compression quantizes any large language model regardless of its architecture, including custom models!

Compress your own Language Model

Consult an Expert
Get started with

picoLLM Compression

The best way to learn about picoLLM is to use it!

Start Now
Forever Free
  • pico.LLAMA
  • pico.Gemma
  • pico.Mistral
  • pico.Mixtral
  • pico.Phi

FAQ

Feature

With more capabilities coming soon, the initial release of picoLLM offers:

picoLLM Compression is a novel large language model (LLM) quantization algorithm developed within Picovoice. Existing techniques require a fixed bit allocation scheme, which is subpar. Given a task-specific cost function, picoLLM Compression automatically learns the optimal bit allocation strategy across and within LLM's weights.

picoLLM Inference runs X-bit quantized LLMs, simplifying the development process to add LLMs to any software. picoLLM Inference is the only local LLM inference engine
  • across Linux, macOS, Windows, Android, iOS, Raspberry Pi, Chrome, Safari, Edge, and Firefox
  • supports CPU and GPU out-of-the-box and has the architecture to tap into other forms of accelerated computing
  • runs any LLM architecture

Yes, picoLLM offers quantized Llama models for free. Quantized Llama models can be downloaded from Picovoice Console within your plan limits, deployed locally across platforms, and freely used with no usage limits.

Yes, picoLLM offers quantized Mistral for free. Quantized Mistral models can be downloaded from Picovoice Console within your plan limits, deployed locally across platforms, and freely used with no usage limits.

Yes, picoLLM offers quantized Mixtral models for free. Quantized Mixtral models can be downloaded from Picovoice Console within your plan limits, deployed locally across platforms, and freely used with no usage limits.

Yes, picoLLM offers quantized Microsoft Phi-2 for free. Quantized Microsoft Phi-2 models can be downloaded from Picovoice Console within your plan limits, deployed locally across platforms, and freely used with no usage limits.

Yes, picoLLM offers quantized Gemma models for free. Quantized Gemma models can be downloaded from Picovoice Console within your plan limits, deployed locally across platforms, and freely used with no usage limits.

Currently, picoLLM GYM is only open to selected enterprise customers. Please engage with your account manager if you’re already a Picovoice customer. If you’re not a customer, become one.

Usage

  • Desktop & Server: Linux, Windows & macOS
  • Mobile: Android & iOS
  • Web Browsers: Chrome, Safari, Edge and Firefox
  • Single Board Computers: Raspberry Pi
  • Cloud Providers: AWS, Azure, Google, IBM, Oracle and others.

Yes, picoLLM is cloud-agnostic and interoperable. You can deploy LLMs in the cloud, work with the cloud provider of your choice, and easily move from one to another.

Yes, you can deploy LLMs in the serverless working with the cloud provider of your choice, and easily move from one to another.

Yes, you can run LLMs on-prem with picoLLM.

Yes, you can run LLMs on mobile devices. picoLLM supports both Android and iOS.

Yes, you can run LLMs within web browsers. picoLLM supports all modern web browsers - Chrome, Safari, Firefox, and Edge.

Yes, you can run LLMs on embedded devices.

picoLLM doesn’t track, access, or store user data.

All Picovoice engines, including picoLLM Inference, use AccessKey to serve you within your plan limits. Forever-Free Plan account owners can use picoLLM Inference with no usage limit. However, picoLLM Inference shares the same infrastructure as the other engines, requiring internet connectivity.

Using Picovoice technology for free is not new to Picovoice community members. The Forever-Free Plan offers access to all Picovoice engines and SDKs, allowing anyone to deploy state-of-the-art AI on-device, on-prem, or in the cloud.

a16z predicts a shift toward open LLMs starting in 2024. In 2023, 80-90% of the LLM market was closed-source, dominated by OpenAI. Now ~60% of the market is open-source as enterprises can control their data and models. We decided to provide robust and efficient open LLMs and unlimited inference for free as they
  • bring the control back to enterprises
  • cut closed-source model dependency, expediting the shift toward open models
  • contribute to the competition in the market, fostering innovation

Technical Questions

There are several advantages of running quantized models:
  • Reduced Model Size: Quantization decreases the model sizes of large language models, resulting in
    • Smaller download size: Quantized LLMs require less time and bandwidth to download. For example, a mobile app using a large-sized language model may not be approved to be on the Apple Store.
    • Smaller storage size: Quantized LLMs occupy less storage space. For example, an Android app using a small language model will take up less storage space, improving the usability of your application, and the experience of users.
    • Less memory usage: Quantized LLMs use less RAM, which speeds up LLM inference and your application and frees up memory for other parts of your application to use, resulting in better performance and stability.
  • Reduced Latency: Compute latency and network latency consist of the total latency.
    • Reduced Compute Latency: Compute latency is the time between a machine receiving a request and the moment returning a response. LLMs require powerful infrastructure to run with minimal compute latency. Otherwise, it may take minutes, even hours, or days to respond. Reduced computational requirements allow quantized LLMs to respond faster given the same resources (reduces latency) or to achieve the same latency using fewer resources.
    • Zero Network Latency: Network latency, delay, or lag shows the time that data takes to transfer across the network. Since quantized LLMs can run where the data is generated rather than requiring data to be sent to a 3rd party cloud, there is no need for the data transfer, hence zero network latency.
Quantization can be used to reduce the size of models and latency potentially at the expense of some accuracy. Choosing the right quantized model is important to ensure small to no accuracy loss. Our Deep Learning Researchers explain why picoLLM Compression is different from other quantization techniques.

picoLLM SDKs are open-source and available via Picovoice’s GitHub and SDK-specific package managers.

We’re currently working on open-sourcing the picoLLM Inference, making the picoLLM compression algorithm available on the Picovoice Console, as well as adding new capabilities to the picoLLM platform to improve the developer experience.

We compare picoLLM Compression algorithm accuracy against popular quantization techniques. Ceteris paribus -at a given size and model. - picoLLM offers better accuracy than the popular quantization techniques, such as AWQ, GPTQ, LLM.int8(), and SqueezeLLM. You can check the open-source compression benchmark to compare the performance of picoLLM Compression against GPTQ.

Please note that there is no only one widely framework used to evaluate LLM accuracy as LLMs are relatively new and capable of performing various tasks. One metric can be more important for a certain task, and irrelevant to others. Taking “accuracy” metrics at face value, and comparing two figures calculated in different settings may lead to wrong conclusions. Also, picoLLM Compression’s value add is retaining the original quality while making LLMs available across platforms, i.e., offering the most efficient models without sacrificing accuracy, not offering the most accurate model. We highly encourage enterprises to compare the accuracy against the original models, e.g., llama-2 70B vs. pico.llama-2 70B at different sizes.

Quantization techniques, such as AWQ, GPTQ, LLM.int8(), and SqueezeLLM are developed by researchers for research. picoLLM is developed by researchers for production to enable enterprise-grade applications.

At any given size, picoLLM retains more of the original quality. In other words, picoLLM compresses models more efficiently than the others, offering efficient models without sacrificing accuracy compared to these techniques. Read more from our deep learning research team about our approach to LLM quantization.

The smaller the models and more powerful the systems are, the faster language models run.
Speed tests (token/second) are generally done in a controlled environment and, unsurprisingly, in favor of the model/vendor. Several factors, hardware (GPU, CPU, RAM, motherboard, original size of the models) and software (background processes and programs), language model, and so on affect the speed.
At Picovoice, our communication has always been fact-based and scientific. Since speed tests are easy to manipulate and it’s impossible to create a reproducible framework we cannot publish any metrics. We strongly suggest everyone run their own tests in their environment.

picoLLM Inference is specifically developed for the picoLLM platform.

Existing inference engines can handle models with known bit distribution (4 or 8-bit) across model weights. picoLLM-compressed weight contains 1, 2, 3, 4, 5, 6, 7, and 8-bit quantized parameters to retain intelligence while minimizing the model size. Hence existing inference engines built for pre-defined bit distribution are not able to match the dynamic nature of picoLLM.
Read more from our engineering team who explained why and how we developed picoLLM Inference engine.

There are three major issues with the existing LLM inference engines.
  1. They are not versatile. They only support certain platforms or model types.
  2. They are not ready-to-use, requiring machine learning knowledge.
  3. They cannot handle X-bit quantization, as this innovative approach is unique to picoLLM Compression.
HuggingFace transformers work with transformers only. TensorFlow Serving works with TensorFlow models only and has a steep learning curve to get started. TorchServe is designed for Pytorch and integrates well with AWS. NVIDIA Triton Inference Server is designed for NVIDIA GPUs only. OpenVINO is optimized for Intel hardware. In reality, your software can and will be run on different platforms. That’s why we had to develop picoLLM Inference. It’s the only ready-to-use and hardware-agnostic engine.

Custom Models & Support

Yes, at the moment custom training is available through picoLLM GYM for selected enterprise customers. Please engage with your account manager if you’re already a Picovoice customer. If you’re not a customer, become one!

Custom LLMs are created for specific tasks and specific use cases. General-purpose large language models are jacks-of-all-trades and masters-of-none. In other words, they can help a student with their homework but not a knowledge worker with company-specific information.

General-purpose LLMs are offered by foundation model providers, such as OpenAI, Google, Meta, Microsoft, Cohere, Anthropic, Mistral, Databricks, and so on. They’re good at developing products such as chatbots, translation services, and content creation apps. Developers building hobby projects, one-size-fits-all applications, or with no access to training datasets can choose general-purpose LLMs.

Custom LLMs can offer distinctive feature sets and increased domain expertise, resulting in unmatched precision and relevance. Hence, custom LLMs have become popular in enterprise applications in several industries, including healthcare, law, and finance. They’re used in various applications, such as medical diagnosis, legal document analysis, and financial risk assessment. Unlike general-purpose LLMs, custom LLMs are not ready to use, they require special training that leverages domain-specific data to perform better in certain use cases.

If you think they’re a better fit, you should. Especially, in the beginning, to have an understanding of what LLMs can achieve, using an API can be a better approach as control over data, model, infrastructure or inference cost is a concern. Closed-source model drawbacks become a concern when enterprises want to have control over their specific use case. If customizability, privacy, ownership, reliability, or inference cost at scale is a concern, then you should be more cautious about choosing a closed-source model.
  1. Customizability: Each vendor has different criteria and processes to develop custom models. In order to send an inquiry to OpenAI, one has to acknowledge that it may take months to train custom models and pricing starts at $2-3million.
  2. Privacy: The default business model for closed-source models is to run inference in the cloud. Hence it requires enterprises to send their user data and confidential information to the cloud.
  3. Ownership: You never have ownership of a closed-source model. If your LLM is critical for the success of your product, or in other words, if you view your LLM as an asset rather than a simple tool, it should be owned and controlled by you.
  4. Reliability: You are at the mercy of closed-source model providers. When their API goes down or has an increase in traffic, the performance of your software, hence user experience and productivity, is negatively affected.
  5. Cost at scale: Cloud computing at scale is costly. That’s why cloud repatriation has become popular among large enterprises. Large Language Model APIs are not different if not more costlier given the size of the models. If your growth estimation involves high-volume inference, do your math carefully.

Yes. Picovoice Consulting works with Enterprise Plan customers to compress their custom or fine-tuned LLMs using the picoLLM Compression engine.

Yes, models trained on your private data or developed by you will be 100% yours.

Yes. picoLLM models, similar to other LLMs, can be used for complex workflows, including retrieval augmented generation (RAG). Large Language Models may struggle with knowledge retrieval while understanding which information is most relevant to each query. Moreover, it may not be optimal to re-train a model every time your knowledge base is updated. RAG applications produce more nuanced and contextually relevant outputs, allowing enterprises to feed the model with information that's always permissions-aware, recent, and relevant.

picoLLM platform supports the most popular and widely-used hardware and software out-of-the-box - from web, mobile, desktop, and on-prem to private cloud. However, there may be certain chipsets we do not currently support. (There are so many of them, yet only so much time and money, making it impossible to support everything.) Thus, open-sourcing the inference engine is on our roadmap. Till then, you can engage with Picovoice Consulting and get the picoLLM inference engine ported to the platform of your choice.

picoLLM platform supports the most popular and widely used SDKs. If you need another SDK, you can check our open-source SDKs and build it yourself or contact Picovoice Consulting. Picovoice Consulting experts can create a public or private library for the SDK of your choice and maintain it.

You can create a GitHub issue under the relevant repository/demo.

Picovoice does not offer dedicated support to Forever-Free Plan users, given the number of developers building with Picovoice and the limited resources we have. You can create a GitHub issue under the relevant repository to provide us with feedback, add product enhancement ideas, or engage with other community members. We appreciate your understanding while we prioritize our enterprise customers for the continuity of our business.

Data Security & Privacy

picoLLM processes data in your environment, whether it’s public or private cloud, on-prem, web, mobile, desktop, or embedded.

picoLLM is private by design and has no access to user data. Thus, picoLLM doesn’t retain user data as it never tracks or stores them in the first place.

Yes. Enterprises using picoLLM don’t need to share their user data with Picovoice or any other 3rd party to run LLMs, making picoLLM intrinsically HIPAA compliant.

Yes. Enterprises using picoLLM don’t need to share their user data with Picovoice or any other 3rd party to run LLMs, making picoLLM intrinsically GDPR compliant.

Yes. Enterprises using picoLLM don’t need to share their user data with Picovoice or any other 3rd party to run LLMs, making picoLLM intrinsically CCPA compliant.

Building with picoLLM

Yes! McKinsey & Co. estimates that we spend 20% of our time looking for internal information or tracking down colleagues who can help with specific tasks. You can save your company a significant amount of time with a generative assistant without breaking the bank or jeopardizing trade secrets and confidential information. Contact Picovoice Consulting if you need a jumpstart.

The answer is “it depends”. Deploying LLMs for production requires diligent work. It depends on your use case, other tools, and the tech stack used, along with hardware and software choice. Given the variables, it can be challenging. Experts from Picovoice Consulting work with Enterprise Plan customers to find the best approach to deploying language models for production.

Developers have a myriad of choices while building LLM applications. Choosing the best AI models that fit the use case is a big challenge, given that there are hundreds of open-source LLMs if not thousands to start with. (Although most are the fine-tuned versions of a few base LLMs.) Best practices depend on the use cases, other tools used, and tech stack.

Unfortunately, there is no zero standardization around this subject. We encourage Picovoice community members to share their projects with [email protected] to inspire fellow developers and enterprises to contact Picovoice Consulting to work with AI experts.

Enterprises face several challenges while building PoCs. Finding a talent experienced in machine learning is one of the biggest challenges to start with. We learned this the hard way, and experience it every day. On top of it, executives and clients may have unrealistic deadlines.

Experts at Picovoice Consulting help enterprises build PoCs, develop their AI strategy, and work with them hand-in-hand offering the guidance they need.