Large Language Models (
LLMs) are prevalent and widespread. Hence, their inference requirements are becoming a point of attention.
A distinguishing feature of an
LLM compared to its predecessors (e.g.,
BERT) is the massive number of parameters, usually above a billion. Hence, they require a lot of computing power to perform billions of operations to generate a single token and enough memory to hold onto billions of parameters. These unforgiving requirements make them expensive to run, requiring multi-GPU setups.
LLM parameters are in the 16-bit floating-point format. That is 2 bytes per parameter. If we can use fewer bytes per parameter, we can reduce the memory requirements for
LLM inference. The authors of
LLM.int8() propose a
Quantization schema to store each parameter of an
LLM in (just above) a single byte. Hence, reducing the memory requirements by two.
Let's consider a matrix-vector multiplication operation. We consider a one-dimensional vector of size for simplicity. is a matrix of size , and is the output vector of size .
LLM.int8() quantizes both activations and parameters
There is an error (hence ) because the quantized values approximate the weights and inputs. The goal is to cap the noise to a level that does not affect the output of the final
Softmax and, hence, the model's effectiveness.
Quantization approach, i.e., quantizing weights in one pass, will degrade performance because if the weight matrix has even a single outlier (extremely big or small element), it greatly affects the error caused by
Quantization as this element shifts
Quantization steps significantly - putting all other parameters at a disadvantage.
LLM.int8() addresses this by quantizing each row separately. Hence, it limits the adverse effect of outliers to a single row.
Suppose we assume features are
Independent and Identically-Distributed (
i.i.d) random variables. In that case, we should care about all weights the same, but they are not. Authors of
LLM.int8() assert that their empirical analysis shows that a small percent of the features have a much higher magnitude. This dominance is problematic. Why? Because it means that even tiny
Quantization errors in weights associated with them cause significant changes in the output.
The last line holds, assuming
Quantization errors are minor compared to the floating-point originals. Now, let's take the th feature is dominant. The error then becomes
The error is proportional to the magnitude of the input feature, which is problematic for these dominant features. Hence, weights associated with dominant features are precious and cannot be perturbed.
LLM.int8() address this? By not quantizing them! In summary,
LLM.int8() quantifies all the weights disconnected from dominant features. But it won't touch the rest. Since the percentage of dominant features is low, it gives us a factor of 2 in memory savings.
Hot Takes 🔥
Does this Scale?
Quantizing weights is a good approach. But it goes only so far (i.e., 1 bit per parameter). Considering the growth trend in
LLM size, we need another method to scale with their size. Perhaps pruning?
Why Quantize Activations?
Quantizing activations seems unnecessary. Why? They are insignificant compared to weights. There is no need to add extra errors, and it is faster to do dot product in floating point because GPUs have good support.
Emergent Features and Overfitting
Dominant features are a significant finding and excellent detective work by authors. The problem is you need to find them empirically. Using predefined criteria, you must run the network on some data and decide which features are dominant. The downside? What if your data is limited? What if it doesn't generalize well to other datasets?