SqueezeLLM is a method for compressing Large Language Models (LLM) to contain their memory and compute requirements at inference time. LLMs have impressive capabilities, but their high inference cost will hinder their large-scale adoption. The eye-watering cost of LLM inference is only a pain for enterprises with an existing production-grade deployment, which are only a few at the moment. Still, it becomes a much larger problem as every company at the proof-of-concept stage graduates into a production environment.

The manuscript is available at SqueezeLLM: Dense-and-Sparse Quantization .

There are three main ideas forming SqueezeLLM.

Sensitive Weights

Any trained deep neural network (such as an LLM) has weights that changing their values can change the output significantly (much more than the rest of the parameters). Hence, perturbing these weights using a quantization operation to compress the model adversely affects the system's performance. Therefore, these sensitive weights need to remain untouched. Luckily, we can still compress the model and retain the sensitive weights as they are a small portion of the model's parameters.

Dense and Sparse Decomposition

After finding the sensitive weights, we keep them in fp16 format and store them in a sparse format. Then, deeply quantize the rest of the parameters.

Non-Uniform Quantization

Non-Uniform Quantization is expensive compared to Uniform Quantization. Why? Because it involves a lot of lookup table operations. However, the authors claim that the main bottleneck in LLM inference is memory access, not instruction execution. Hence, the extra benefit of Non-Uniform Quantization is justified if it allows deeper quantization and less time waiting on loading data from RAM.