SqueezeLLM is a method for compressing
Large Language Models (
LLM) to contain their memory and compute requirements at inference time.
LLMs have impressive capabilities, but their high inference cost will hinder their large-scale adoption. The eye-watering cost of
LLM inference is only a pain for enterprises with an existing production-grade deployment, which are only a few at the moment. Still, it becomes a much larger problem as every company at the proof-of-concept stage graduates into a production environment.
The manuscript is available at SqueezeLLM: Dense-and-Sparse Quantization .
There are three main ideas forming
Any trained deep neural network (such as an
LLM) has weights that changing their values can change the output significantly (much more than the rest of the parameters). Hence, perturbing these weights using a quantization operation to compress the model adversely affects the system's performance. Therefore, these sensitive weights need to remain untouched. Luckily, we can still compress the model and retain the sensitive weights as they are a small portion of the model's parameters.
Dense and Sparse Decomposition
After finding the sensitive weights, we keep them in fp16 format and store them in a sparse format. Then, deeply quantize the rest of the parameters.
Non-Uniform Quantization is expensive compared to
Uniform Quantization. Why? Because it involves a lot of lookup table operations. However, the authors claim that the main bottleneck in
LLM inference is memory access, not instruction execution. Hence, the extra benefit of
Non-Uniform Quantization is justified if it allows deeper quantization and less time waiting on loading data from RAM.