LLM Compression Benchmark

Quantizing large language models (LLMs) is crucial for reducing their size and memory usage while preserving quality. This compression technique enables deploying advanced models on devices with limited computational capabilities. This benchmark evaluates the performance of Picovoice picoLLM in comparison to GPTQ, a state-of-the-art LLM compression method, across various metrics.

Methodology

Algorithms

We use the following algorithms to compress LLMs:

GPTQ is a popular quantization algorithm that fully reconstructs weights to closely mimic the full-precision model.
picoLLM Compression is Picovoice's in-house LLM compression algorithm. Given a target size, picoLLM optimally distributes available bits within and across LLM's weights.

Tasks

We evaluate the performance of GPTQ and picoLLM on the following tasks:

MMLU (Massively Multilingual Language Understanding) is a multiple-choice dataset that can measure the model's ability to understand natural language.
ARC (AI2 Reasoning Challenge) is a multiple-choice dataset that measures the models' reasoning ability.
Perplexity is an evaluation metric that measures the quality of language models. C4 is used to evaluate the perplexity of the models.

Models

We evaluate the performance of GPTQ and picoLLM on the following models:

Results

The figures below depict the accuracy of each compression engine across various models for MMLU, ARC, and perplexity.

MMLU Score Comparison

ARC Score Comparison

Perplexity Comparison

Usage

The data and code used to create this benchmark are available on GitHub under the permissive Apache 2.0 license. Detailed instructions for benchmarking individual engines are provided in the following documents:

Was this doc helpful?

Issue with this doc?