Activation Aware Quantization (AWQ) is a simple yet powerful method for quantizing (compressing) Large Language Models (LLMs) to reduce their runtime and storage requirements for inference.

The manuscript is available at AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration .

AWQ Ideas

There two main ideas forming the AWQ method.

Weight Activation Magnitude

Most quantization methods quantize weights based on their magnitude. i.e., they set smaller weights to zero and round the rest to the nearest quantization threshold. This strategy gives optimal layerwise quantization error if the input is Independent and Identically-Distributed (i.i.d). This assumption holds in many models but certainly not in large Transformers, which are the building block of Generative Pretrained Transformer (GPT), the dominant Large Langauge Model (LLM) architecture introduced by OpenAI. Previous work, such as LLM.int8, found that a few features have much larger magnitude and dominate the rest of the features. These features are essential in the model's performance; hence, any parameter that accepts them as the input should remain intact.

AWQ uses this insight and leaves these Salient Parameters almost untouched. Hence, it chooses how to quantize based on activation magnitude, not weight magnitude.

Mixed Precision Implementation

Keeping some parameters fp16 and the rest as integers can make implementation difficult. The authors proposed a method to scale the weights corresponding to dominant features per channel and keep them as integers. They justify that this improves the performance.