LLM Compression Benchmark
Quantizing large language models (LLMs) is crucial for reducing their size and memory usage while preserving quality. This compression technique enables deploying advanced models on devices with limited computational capabilities. This benchmark evaluates the performance of Picovoice picoLLM in comparison to GPTQ, a state-of-the-art LLM compression method, across various metrics.
Methodology
Algorithms
We use the following algorithms to compress LLMs:
- GPTQ is a popular quantization algorithm that fully reconstructs weights to closely mimic the full-precision model.
- picoLLM Compression is Picovoice's in-house LLM compression algorithm. Given a target size, picoLLM optimally distributes available bits within and across LLM's weights.
Tasks
We evaluate the performance of GPTQ and picoLLM on the following tasks:
- MMLU (Massively Multilingual Language Understanding) is a multiple-choice dataset that can measure the model's ability to understand natural language.
- ARC (AI2 Reasoning Challenge) is a multiple-choice dataset that measures the models' reasoning ability.
- Perplexity is an evaluation metric that measures the quality of language models. C4 is used to evaluate the perplexity of the models.
Models
We evaluate the performance of GPTQ and picoLLM on the following models:
Results
The figures below depict the accuracy of each compression engine across various models for MMLU, ARC, and perplexity.
MMLU Score Comparison
ARC Score Comparison
Perplexity Comparison
Usage
The data and code used to create this benchmark are available on GitHub under the permissive Apache 2.0 license. Detailed instructions for benchmarking individual engines are provided in the following documents: