GPTQ
is a neural network compression technique that enables the efficient deployment of Generative Pretrained Transformers
(GPT
). GPTs
are a specific type of Large Language Model
(LLM
) developed by OpenAI.
The full manuscript of the paper is available at GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers.
Why GPTQ?
Thanks to their breakthrough performance, LLMs
set themselves apart from traditional Language Models
(LM
). Yet, this comes at a massive inference cost. Most LLMs
have billions, if not tens of billions, of parameters. Running these models requires 100s gigabytes of storage and multi-GPU servers, which can be prohibitive in terms of cost.
Two active research directions aim to reduce the inference cost of GPTs
. One avenue is to train more efficient and smaller models. The second method is to make existing models smaller post-training. The second method has the advantage of not requiring any re-training, which is prohibitively expensive and time-consuming for LLMs
. GPTQ
falls in the second category.
How Does GPTQ work?
GPTQ
is a Layerwise Quantization
algorithm. GPTQ
quantizes the weights of the LLM
one by one in isolation. GPTQ
converts the floating-point parameters of each weight matrix into quantized integers such that the error at the output is minimized.
Layerwise Quantization
The Layerwise Quantization
aims to find quantized values that minimize the error at the output.
There are a few things to pay attention to when looking at the formula above:
- The formulation requires having an understanding of the input's statistics.
GPTQ
is aOne-Shot Quantization
, not aZero-Shot Quantization
, as it relies on the distribution of the input features. - It assumes that the quantization steps are set before running the algorithm.
Optimal Brain Quantization
Optimal Brain Quantization
is another paper published by the same authors that represents a Greedy Algorithm
to solve the equation above. In summary, it will quantize weights one by one, and every time a weight is quantized, it will adjust the remaining ones to minimize the induced quantization error.
The problem with the original algorithm is that it is too complex to run on a model with the size of GPT
. Remember that matrices in these models each are tens of millions of parameters. The rest of the paper represents math and engineering tricks to make this algorithm run fast enough for models of the size of LLMs
.
Check our other articles on compression techniques, such as AWQ, LLM.int8() and SqueezeLLM, or engage with Picovoice Consulting to discuss your LLM strategy.
Consult an Expert