The LLM compression algorithm with unmatched accuracy, reducing runtime and storage requirements of any LLM while retaining model performance
picoLLM Compression is a quantization algorithm that outperforms any existing quantization techniques and speeding up LLM inference by shrinking the model.
picoLLM Compression comes with picoLLM Inference Engine, which runs on CPU and GPU across Linux, macOS, Windows, Android, iOS, Chrome, Safari, Edge, Firefox, and Raspberry Pi or other embedded systems in a few lines of code.
Never heard of X-bit LLM Quantization before? You’re not alone. It’s new and unique to Picovoice.
Existing quantization techniques require a fixed bit allocation scheme, mostly 8-bit or 4-bit. Picovoice researchers found this approach suboptimal and came up with the X-bit quantization.
picoLLM compression automatically learns the optimal bit allocation strategy and quantizes LLMs to minimize loss by allocating optimal bits across and within weights. Learn what makes picoLLM Compression unique from deep learning researchers.