The LLM compression algorithm with unmatched accuracy, reducing runtime and storage requirements of any LLM
picoLLM Compression is a quantization algorithm that outperforms any existing quantization techniques and speeds up LLM inference by shrinking the model.
picoLLM Compression comes with picoLLM Inference Engine, which runs on CPU and GPU across Linux, macOS, Windows, Android, iOS, Chrome, Safari, Edge, Firefox, and Raspberry Pi or other embedded systems in a few lines of code.
Never heard of X-bit LLM Quantization before? You’re not alone. It’s new and unique to Picovoice.
Existing quantization techniques require a fixed bit allocation scheme, mostly 8-bit or 4-bit. Picovoice researchers found this approach suboptimal and came up with the X-bit quantization.
picoLLM compression automatically learns the optimal bit allocation strategy and quantizes LLMs to minimize loss by allocating optimal bits across and within weights. Learn what makes picoLLM Compression unique from deep learning researchers.
Yes, picoLLM offers ready-to-use quantized Mistral 7B. Supported Mistral models can be downloaded from Picovoice Console within your plan limits, and deployed locally across platforms
Note: Mistral 7B is the only open-source model offered under the Mistral name. Mistral Large, Mistral Medium, and Mistral Small are API only, hence developers do not have access to the models directly.
There are several advantages of running quantized models:
Quantization can be used to reduce the size of models and latency potentially at the expense of some accuracy. Choosing the right quantized model is important to ensure small to no accuracy loss. Our Deep Learning Researchers explain why picoLLM Compression is different from other quantization techniques.
We compare picoLLM Compression algorithm accuracy against popular quantization techniques. Ceteris paribus - at a given size and model. picoLLM offers better accuracy than the popular quantization techniques, such as AWQ, GPTQ,LLM.int8(), and SqueezeLLM. You can check theopen-source compression benchmark to compare the performance of picoLLM Compression against GPTQ.
Please note that there is no singular widely-accepted framework used to evaluate LLM accuracy as LLMs are relatively new and capable of performing various tasks. One metric can be more important for a certain task, and irrelevant to others. Taking “accuracy” metrics at face value, and comparing two figures calculated in different settings may lead to wrong conclusions. Also, picoLLM Compression’s value add is retaining the original quality while making LLMs available across platforms, i.e., offering the most efficient models without sacrificing accuracy, not offering the most accurate model. We highly encourage enterprises to compare the accuracy against the original models, e.g., llama-2 70B vs. pico.llama-2 70B at different sizes.
Quantization techniques, such as AWQ, GPTQ,LLM.int8(), and SqueezeLLM are developed by researchers for research. picoLLM is developed by researchers for production to enable enterprise-grade applications.
At any given size, picoLLM retains more of the original quality. In other words, picoLLM compresses models more efficiently than the others, offering efficient models without sacrificing accuracy compared to these techniques.
Read more from our deep learning research team about our approach to LLM quantization.
The smaller the models and more powerful the systems are, the faster language models run.
Speed tests (token/second) are generally done in a controlled environment and, unsurprisingly, in favor of the model/vendor. Several factors, hardware (GPU, CPU, RAM, motherboard, original size of the models) and software (background processes and programs), language model, and so on affect the speed.
At Picovoice, our communication has always been fact-based and scientific. Since speed tests are easy to manipulate and it’s impossible to create a reproducible framework we cannot publish any metrics. We strongly suggest everyone run their own tests in their environment.
There are three major issues with the existing LLM inference engines.
HuggingFace transformers work with transformers only. TensorFlow Serving works with TensorFlow models only and has a steep learning curve to get started. TorchServe is designed for Pytorch and integrates well with AWS. NVIDIA Triton Inference Server is designed for NVIDIA GPUs only. OpenVINO is optimized for Intel hardware. In reality, your software can and will be run on different platforms. That’s why we had to develop picoLLM Inference. It’s the only ready-to-use and hardware-agnostic engine.
Developers have a myriad of choices while building LLM applications. Choosing the best AI models that fit the use case is a big challenge, given that there are hundreds of open-source LLMs if not thousands to start with. (Although most are the fine-tuned versions of a few base LLMs.) Best practices depend on the use cases, other tools used, and tech stack.
Unfortunately, there is no zero standardization around this subject. We encourage Picovoice community members to share their projects with [email protected] to inspire fellow developers and enterprises to contact Picovoice Consulting to work with AI experts.