SqueezeLLM
is a method for compressing Large Language Models
(LLM
) to contain their memory and compute requirements at inference time. LLMs
have impressive capabilities, but their high inference cost will hinder their large-scale adoption. The eye-watering cost of LLM
inference is only a pain for enterprises with an existing production-grade deployment, which are only a few at the moment. Still, it becomes a much larger problem as every company at the proof-of-concept stage graduates into a production environment.
The manuscript is available at SqueezeLLM: Dense-and-Sparse Quantization.
There are three main ideas forming SqueezeLLM
.
Sensitive Weights
Any trained deep neural network (such as an LLM
) has weights that changing their values can change the output significantly (much more than the rest of the parameters). Hence, perturbing these weights using a quantization operation to compress the model adversely affects the system's performance. Therefore, these sensitive weights need to remain untouched. Luckily, we can still compress the model and retain the sensitive weights as they are a small portion of the model's parameters.
Dense and Sparse Decomposition
After finding the sensitive weights, we keep them in fp16 format and store them in a sparse format. Then, deeply quantize the rest of the parameters.
Non-Uniform Quantization
Non-Uniform Quantization
is expensive compared to Uniform Quantization
. Why? Because it involves a lot of lookup table operations. However, the authors claim that the main bottleneck in LLM
inference is memory access, not instruction execution. Hence, the extra benefit of Non-Uniform Quantization
is justified if it allows deeper quantization and less time waiting on loading data from RAM.
Check our other articles on compression techniques, such as AWQ, GPTQ, and LLM.int8(), or engage with Picovoice Consulting to discuss your LLM strategy.
Consult an Expert