## Problem 🤕

Today, `Large Language Models`

(`LLMs`

) are prevalent and widespread. Hence, their inference requirements are becoming a point of attention.

A distinguishing feature of an `LLM`

compared to its predecessors (e.g., `BERT`

) is the massive number of parameters, usually above a billion. Hence, they require a lot of computing power to perform billions of operations to generate a single token and enough memory to hold onto billions of parameters. These unforgiving requirements make them expensive to run, requiring multi-GPU setups.

## Solution 💡

`LLM`

parameters are in the 16-bit floating-point format. That is 2 bytes per parameter. If we can use fewer bytes per parameter, we can reduce the memory requirements for `LLM`

inference. The authors of `LLM.int8()`

propose a `Quantization`

schema to store each parameter of an `LLM`

in (just above) a single byte. Hence, reducing the memory requirements by two.

Let's consider a matrix-vector multiplication operation. We consider a one-dimensional vector of size for simplicity. is a matrix of size , and is the output vector of size .

or equivalently

`LLM.int8()`

quantizes both activations and parameters

There is an error (hence ) because the quantized values approximate the weights and inputs. The goal is to cap the noise to a level that does not affect the output of the final `Softmax`

and, hence, the model's effectiveness.

## Challenges 🤷

### Outlier Parameters

A naive `Quantization`

approach, i.e., quantizing weights in one pass, will degrade performance because if the weight matrix has even a single outlier (extremely big or small element), it greatly affects the error caused by `Quantization`

as this element shifts `Quantization`

steps significantly - putting all other parameters at a disadvantage. `LLM.int8()`

addresses this by quantizing each row separately. Hence, it limits the adverse effect of outliers to a single row.

### Emergent Features

Suppose we assume features are `Independent and Identically-Distributed`

(`i.i.d`

) random variables. In that case, we should care about all weights the same, but they are not. Authors of `LLM.int8()`

assert that their empirical analysis shows that a small percent of the features have a much higher magnitude. This dominance is problematic. Why? Because it means that even tiny `Quantization`

errors in weights associated with them cause significant changes in the output.

The last line holds, assuming `Quantization`

errors are minor compared to the floating-point originals. Now, let's take the th feature is dominant. The error then becomes

The error is proportional to the magnitude of the input feature, which is problematic for these dominant features. Hence, weights associated with dominant features are precious and cannot be perturbed.

How does `LLM.int8()`

address this? By not quantizing them! In summary, `LLM.int8()`

quantifies all the weights disconnected from dominant features. But it won't touch the rest. Since the percentage of dominant features is low, it gives us a factor of 2 in memory savings.

## Hot Takes 🔥

### Does this Scale?

Quantizing weights is a good approach. But it goes only so far (i.e., 1 bit per parameter). Considering the growth trend in `LLM`

size, we need another method to scale with their size. Perhaps pruning?

### Why Quantize Activations?

Quantizing activations seems unnecessary. Why? They are insignificant compared to weights. There is no need to add extra errors, and it is faster to do dot product in floating point because GPUs have good support.

### Emergent Features and Overfitting

Dominant features are a significant finding and excellent detective work by authors. The problem is you need to find them empirically. Using predefined criteria, you must run the network on some data and decide which features are dominant. The downside? What if your data is limited? What if it doesn't generalize well to other datasets?