picoLLM Compression

Unmatched Accuracy with Next-Gen LLM Quantization

The LLM compression algorithm with unmatched accuracy, reducing runtime and storage requirements of any LLM

Model used: Phi-2
Hello, Phi-2!
Hello! Start the demo to begin a conversation.

What is picoLLM Compression?

picoLLM Compression is a quantization algorithm that outperforms any existing quantization techniques and speeds up LLM inference by shrinking the model.

picoLLM Compression comes with picoLLM Inference Engine, which runs on CPU and GPU across Linux, macOS, Windows, Android, iOS, Chrome, Safari, Edge, Firefox, and Raspberry Pi or other embedded systems in a few lines of code.

X-bit LLM Quantization

Never heard of X-bit LLM Quantization before? You’re not alone. It’s new and unique to Picovoice.

Existing quantization techniques require a fixed bit allocation scheme, mostly 8-bit or 4-bit. Picovoice researchers found this approach suboptimal and came up with the X-bit quantization.

picoLLM compression automatically learns the optimal bit allocation strategy and quantizes LLMs to minimize loss by allocating optimal bits across and within weights. Learn what makes picoLLM Compression unique from deep learning researchers.

Achieve Unmatched Accuracy

picoLLM Compression automatically learns the optimal bit allocation strategy, beating alternative quantization techniques, proven by an open-source benchmark.

For example, when applied to Llama-3-8B, picoLLM Compression recovers MMLU score degradation of widely adopted GPTQ by 91%, 99%, and 100% at 2, 3, and 4-bit settings.

Speed up LLM Inference with a Smaller Footprint

LLM inference is both compute and memory-bound. picoLLM eases LLMs’ storage and reduces the required memory for inference, allowing cross-platform deployment and more efficient computations.

Quantize any Large Language Model

Most popular open-weight models, Gemma, Llama, Mistral, Mixtral, and Phi, compressed by picoLLM Compression are available with pico.LLM Inference. New models will be added as foundation model developers release new and improved open-weight models, so you can always have access to the latest and greatest.
  • 👧
    pico.Gemma
  • 🦙
    pico.Llama
  •   φ
    pico.Phi
  • pico.Mistral
  • 🍸
    pico.Mixtral
  • 🃏
    Custom
picoLLM Compression quantizes any large language model regardless of its architecture, including custom models!

Compress your own Language Model

Consult an Expert
Get started with

picoLLM Compression

The best way to learn about picoLLM is to use it!

Start Free
  • pico.LLAMA
  • pico.Gemma
  • pico.Mistral
  • pico.Mixtral
  • pico.Phi

FAQ

Feature

picoLLM Compression is a novel large language model (LLM) quantization algorithm developed within Picovoice. Existing techniques require a fixed bit allocation scheme, which is subpar. Given a task-specific cost function, picoLLM Compression automatically learns the optimal bit allocation strategy across and within LLM's weights.

Yes, picoLLM offers ready-to-use quantized Llama models. Supported Llama models that can be downloaded from Picovoice Console within your plan limits, and deployed locally across platforms

Yes, picoLLM offers ready-to-use quantized Mistral 7B. Supported Mistral models can be downloaded from Picovoice Console within your plan limits, and deployed locally across platforms

Note: Mistral 7B is the only open-source model offered under the Mistral name. Mistral Large, Mistral Medium, and Mistral Small are API only, hence developers do not have access to the models directly.

Yes, picoLLM offers ready-to-use quantized Mixtral models. Supported Mixtral models can be downloaded from Picovoice Console within your plan limits, and deployed locally across platforms.

Yes, picoLLM offers ready-to-use quantized Microsoft Phi models. Supported Microsoft Phi models can be downloaded from Picovoice Console within your plan limits, and deployed locally across platforms

Yes, picoLLM offers ready-to-use quantized Gemma models. Supported Gemma models can be downloaded from Picovoice Console within your plan limits, and deployed locally across platforms

Currently, picoLLM Compression is only open to selected enterprise customers. You can engage with your account manager when you become a Picovoice customer.

Technical Questions

There are several advantages of running quantized models:

  • Reduced Model Size: Quantization decreases the model sizes of large language models, resulting in
    • Smaller download size: Quantized LLMs require less time and bandwidth to download. For example, a mobile app using a large-sized language model may not be approved to be on the Apple Store.
    • Smaller storage size: Quantized LLMs occupy less storage space. For example, an Android app using a small language model will take up less storage space, improving the usability of your application, and the experience of users.
    • Less memory usage: Quantized LLMs use less RAM, which speeds up LLM inference and your application and frees up memory for other parts of your application to use, resulting in better performance and stability.
  • Reduced Latency: Compute latency and network latency consist of the total latency.
    • Reduced Compute Latency: Compute latency is the time between a machine receiving a request and the moment returning a response. LLMs require powerful infrastructure to run with minimal compute latency. Otherwise, it may take minutes, even hours, or days to respond. Reduced computational requirements allow quantized LLMs to respond faster given the same resources (reduces latency) or to achieve the same latency using fewer resources.
    • Zero Network Latency: Network latency, delay, or lag shows the time that data takes to transfer across the network. Since quantized LLMs can run where the data is generated rather than requiring data to be sent to a 3rd party cloud, there is no need for the data transfer, hence zero network latency.

Quantization can be used to reduce the size of models and latency potentially at the expense of some accuracy. Choosing the right quantized model is important to ensure small to no accuracy loss. Our Deep Learning Researchers explain why picoLLM Compression is different from other quantization techniques.

We compare picoLLM Compression algorithm accuracy against popular quantization techniques. Ceteris paribus - at a given size and model. picoLLM offers better accuracy than the popular quantization techniques, such as AWQ, GPTQ,LLM.int8(), and SqueezeLLM. You can check theopen-source compression benchmark to compare the performance of picoLLM Compression against GPTQ.

Please note that there is no singular widely-accepted framework used to evaluate LLM accuracy as LLMs are relatively new and capable of performing various tasks. One metric can be more important for a certain task, and irrelevant to others. Taking “accuracy” metrics at face value, and comparing two figures calculated in different settings may lead to wrong conclusions. Also, picoLLM Compression’s value add is retaining the original quality while making LLMs available across platforms, i.e., offering the most efficient models without sacrificing accuracy, not offering the most accurate model. We highly encourage enterprises to compare the accuracy against the original models, e.g., llama-2 70B vs. pico.llama-2 70B at different sizes.

Quantization techniques, such as AWQ, GPTQ,LLM.int8(), and SqueezeLLM are developed by researchers for research. picoLLM is developed by researchers for production to enable enterprise-grade applications.

At any given size, picoLLM retains more of the original quality. In other words, picoLLM compresses models more efficiently than the others, offering efficient models without sacrificing accuracy compared to these techniques.

Read more from our deep learning research team about our approach to LLM quantization.

The smaller the models and more powerful the systems are, the faster language models run.

Speed tests (token/second) are generally done in a controlled environment and, unsurprisingly, in favor of the model/vendor. Several factors, hardware (GPU, CPU, RAM, motherboard, original size of the models) and software (background processes and programs), language model, and so on affect the speed.

At Picovoice, our communication has always been fact-based and scientific. Since speed tests are easy to manipulate and it’s impossible to create a reproducible framework we cannot publish any metrics. We strongly suggest everyone run their own tests in their environment.

There are three major issues with the existing LLM inference engines.

  1. They are not versatile. They only support certain platforms or model types.
  2. They are not ready-to-use, requiring machine learning knowledge.
  3. They cannot handle X-bit quantization, as this innovative approach is unique to picoLLM Compression.

HuggingFace transformers work with transformers only. TensorFlow Serving works with TensorFlow models only and has a steep learning curve to get started. TorchServe is designed for Pytorch and integrates well with AWS. NVIDIA Triton Inference Server is designed for NVIDIA GPUs only. OpenVINO is optimized for Intel hardware. In reality, your software can and will be run on different platforms. That’s why we had to develop picoLLM Inference. It’s the only ready-to-use and hardware-agnostic engine.

Custom Models & Support

Yes. Picovoice Consulting works with Enterprise Plan customers to compress their custom or fine-tuned LLMs using the picoLLM inference engine.

Yes, models trained on your private data or developed by you will be 100% yours.

Yes. picoLLM models, similar to other LLMs, can be used for complex workflows, including retrieval augmented generation (RAG). Large Language Models may struggle with knowledge retrieval while understanding which information is most relevant to each query. Moreover, it may not be optimal to re-train a model every time your knowledge base is updated. RAG applications produce more nuanced and contextually relevant outputs, allowing enterprises to feed the model with information that's always permissions-aware, recent, and relevant.

picoLLM platform supports the most popular and widely-used hardware and software out-of-the-box - from web, mobile, desktop, and on-prem to private cloud. However, there may be certain chipsets we do not currently support. (There are so many of them, yet only so much time and money, making it impossible to support everything.) Thus, open-sourcing the inference engine is on our roadmap. Until then, you can engage with Picovoice Consulting and get the picoLLM inference engine ported to the platform of your choice.

You can create a GitHub issue under the relevant repository/demo.

Data Security & Privacy

picoLLM processes data in your environment, whether it's public or private cloud, on-prem, web, mobile, desktop, or embedded.

picoLLM is private by design and has no access to user data. Thus, picoLLM doesn't retain user data as it never tracks or stores them in the first place.

Yes. Enterprises using picoLLM don't need to share their user data with Picovoice or any other 3rd party to run LLMs, making picoLLM intrinsically HIPAA compliant.

Yes. Enterprises using picoLLM don't need to share their user data with Picovoice or any other 3rd party to run LLMs, making picoLLM intrinsically GDPR compliant.

Yes. Enterprises using picoLLM don't need to share their user data with Picovoice or any other 3rd party to run LLMs, making picoLLM intrinsically CCPA compliant.

Building with picoLLM

Yes! McKinsey & Co. estimates that we spend 20% of our time looking for internal information or tracking down colleagues who can help with specific tasks. You can save your company a significant amount of time with a generative assistant without breaking the bank or jeopardizing trade secrets and confidential information. Contact Picovoice Consulting if you need a jumpstart.

The answer is "it depends". Deploying LLMs for production requires diligent work. It depends on your use case, other tools, and the tech stack used, along with hardware and software choice. Given the variables, it can be challenging. Experts from Picovoice Consulting work with Enterprise Plan customers to find the best approach to deploying language models for production.

Developers have a myriad of choices while building LLM applications. Choosing the best AI models that fit the use case is a big challenge, given that there are hundreds of open-source LLMs if not thousands to start with. (Although most are the fine-tuned versions of a few base LLMs.) Best practices depend on the use cases, other tools used, and tech stack.

Unfortunately, there is no zero standardization around this subject. We encourage Picovoice community members to share their projects with [email protected] to inspire fellow developers and enterprises to contact Picovoice Consulting to work with AI experts.