GPTQ is a neural network compression technique that enables the efficient deployment of Generative Pretrained Transformers (GPT). GPTs are a specific type of Large Language Model (LLM) developed by OpenAI.

The full manuscript of the paper is available at GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers .


Thanks to their breakthrough performance, LLMs set themselves apart from traditional Language Models (LM). Yet, this comes at a massive inference cost. Most LLMs have billions, if not tens of billions, of parameters. Running these models requires 100s gigabytes of storage and multi-GPU servers, which can be prohibitive in terms of cost.

Two active research directions aim to reduce the inference cost of GPTs. One avenue is to train more efficient and smaller models. The second method is to make existing models smaller post-training. The second method has the advantage of not requiring any re-training, which is prohibitively expensive and time-consuming for LLMs. GPTQ falls in the second category.

How Does GPTQ work?

GPTQ is a Layerwise Quantization algorithm. GPTQ quantizes the weights of the LLM one by one in isolation. GPTQ converts the floating-point parameters of each weight matrix into quantized integers such that the error at the output is minimized.

Layerwise Quantization

The Layerwise Quantization aims to find quantized values that minimize the error at the output.

There are a few things to pay attention to when looking at the formula above:

  • The formulation requires having an understanding of the input's statistics. GPTQ is a One-Shot Quantization, not a Zero-Shot Quantization, as it relies on the distribution of the input features.
  • It assumes that the quantization steps are set before running the algorithm.

Optimal Brain Quantization

Optimal Brain Quantization is another paper published by the same authors that represents a Greedy Algorithm to solve the equation above. In summary, it will quantize weights one by one, and every time a weight is quantized, it will adjust the remaining ones to minimize the induced quantization error.

The problem with the original algorithm is that it is too complex to run on a model with the size of GPT. Remember that matrices in these models each are tens of millions of parameters. The rest of the paper represents math and engineering tricks to make this algorithm run fast enough for models of the size of LLMs.