Activation Aware Quantization (
AWQ) is a simple yet powerful method for quantizing (compressing)
Large Language Models (
LLMs) to reduce their runtime and storage requirements for inference.
The manuscript is available at AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration .
There two main ideas forming the
Weight Activation Magnitude
Most quantization methods quantize weights based on their magnitude. i.e., they set smaller weights to zero and round the rest to the nearest quantization threshold. This strategy gives optimal layerwise quantization error if the input is
Independent and Identically-Distributed (
i.i.d). This assumption holds in many models but certainly not in large
Transformers, which are the building block of
Generative Pretrained Transformer (
GPT), the dominant
Large Langauge Model (
LLM) architecture introduced by OpenAI. Previous work, such as LLM.int8, found that a few features have much larger magnitude and dominate the rest of the features. These features are essential in the model's performance; hence, any parameter that accepts them as the input should remain intact.
AWQ uses this insight and leaves these
Salient Parameters almost untouched. Hence, it chooses how to quantize based on activation magnitude, not weight magnitude.
Mixed Precision Implementation
Keeping some parameters fp16 and the rest as integers can make implementation difficult. The authors proposed a method to scale the weights corresponding to dominant features per channel and keep them as integers. They justify that this improves the performance.