`GPTQ`

is a neural network compression technique that enables the efficient deployment of `Generative Pretrained Transformers`

(`GPT`

). `GPTs`

are a specific type of `Large Language Model`

(`LLM`

) developed by OpenAI.

The full manuscript of the paper is available at GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers .

## Why GPTQ?

Thanks to their breakthrough performance, `LLMs`

set themselves apart from traditional `Language Models`

(`LM`

). Yet, this comes at a massive inference cost. Most `LLMs`

have billions, if not tens of billions, of parameters. Running these models requires 100s gigabytes of storage and multi-GPU servers, which can be prohibitive in terms of cost.

Two active research directions aim to reduce the inference cost of `GPTs`

. One avenue is to train more efficient and smaller models. The second method is to make existing models smaller post-training. The second method has the advantage of not requiring any re-training, which is prohibitively expensive and time-consuming for `LLMs`

. `GPTQ`

falls in the second category.

## How Does GPTQ work?

`GPTQ`

is a `Layerwise Quantization`

algorithm. `GPTQ`

quantizes the weights of the `LLM`

one by one in isolation. `GPTQ`

converts the floating-point parameters of each weight matrix into quantized integers such that the error at the output is minimized.

### Layerwise Quantization

The `Layerwise Quantization`

aims to find quantized values that minimize the error at the output.

There are a few things to pay attention to when looking at the formula above:

- The formulation requires having an understanding of the input's statistics.
`GPTQ`

is a`One-Shot Quantization`

, not a`Zero-Shot Quantization`

, as it relies on the distribution of the input features. - It assumes that the quantization steps are set before running the algorithm.

### Optimal Brain Quantization

`Optimal Brain Quantization`

is another paper published by the same authors that represents a `Greedy Algorithm`

to solve the equation above. In summary, it will quantize weights one by one, and every time a weight is quantized, it will adjust the remaining ones to minimize the induced quantization error.

The problem with the original algorithm is that it is too complex to run on a model with the size of `GPT`

. Remember that matrices in these models each are tens of millions of parameters. The rest of the paper represents math and engineering tricks to make this algorithm run fast enough for models of the size of `LLMs`

.