Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization

[TL;DR]
PEQA is a novel fine-tuning approach that integrates Parameter-Efficient Fine-Tuning (PEFT) with weight-only quantized LLMs by updating only the quantization scales, preserving low-bit integer weight matrices. This results in huge memory savings, seamless task adaptation, and inference acceleration.

Highlights

Integrates PEFT with quantized LLMs, updating only the quantization scales while keeping integer matrices frozen.
Reduces memory consumption during fine-tuning and deployment, making LLM adaptation feasible even for resource-constrained settings.
Maintains quantization benefits post fine-tuning, ensuring accelerated inference.
Demonstrates resilience in performance recovery, even for sub-4-bit quantized models, on large-scale instruction datasets.
Scales up to 65B parameter models while achieving performance close to full-precision LoRA fine-tuning.

Summary

Problem Statement: LLM fine-tuning is memory-intensive, even with PEFT such as LoRA, as full-precision weights remain a bottleneck. Quantization can reduce memory but is typically applied post-training, which limits the adaptability.
Solution: PEQA bridges this gap by fine-tuning only the quantization scales of a pre-quantized LLM while keeping the integer weights frozen. This enables task-specific adaptation with minimal overhead.
PEQA Framework:
- Step 1 Decomposition: Pre-trained model weights are quantized into sub-4-bit integers with associated scaling factors.
- Step 2 Fine-tuning: Only the quantization scales are updated while maintaining the frozen integer matrix, significantly reducing learnable parameters.

Key Advantages

Memory Efficiency

Fine-tunes only the quantization scales, significantly reducing memory overhead.
Optimized for low-bit integer quantization (≤ 4-bit) while maintaining high accuracy.

Seamless Task Switching

PEQA enables quick and efficient adaptation across different tasks by swapping quantization scales instead of retraining entire models.

Faster Inference

The frozen integer matrix remains intact, ensuring post-fine-tuning speedup using quantized inference kernels.

Experiments

Memory and General Comparison

PEQA vs. QAT vs. PEFT+PTQ

PEQA achieves performance close to QAT, significantly outperforming LoRA + PTQ at 3-bit and 4-bit precision.
Lower perplexity indicates effective fine-tuning of quantized models without sacrificing accuracy.

Instruction-Tuning with Alpaca Dataset

Evaluated on common-sense reasoning and in-context learning tasks (ARC, PIQA, HellaSwag).
Performance comparable to LoRA, with additional memory savings and inference acceleration.

Notations

Quantized Weights and Fine-Tuning

Weight-only asymmetric quantization:
Given a fully-connected layer \(\mathbf{W}_0 \in \mathbb{R}^{n \times m}\), a given bit-width \(b\), per-channel scales and zero-points \(\mathbf{s}_0, \mathbf{z}_0 \in \mathbb{R}^{n \times 1}\), asymmetric quantized pre-trained weights \(\widehat{\mathbf{W}}_0\) can be written as

\[\widehat{\mathbf{W}}_0 = \mathbf{s}_0 \cdot \overline{\mathbf{W}}_0 = \mathbf{s}_0 \cdot \left( \text{clamp} \left( \left\lfloor \frac{\mathbf{W}_0}{\mathbf{s}_0} \right\rfloor + \mathbf{z}_0, 0, 2^b - 1 \right) - \mathbf{z}_0 \right),\]

PEQA fine-tuning modifies only the quantization scale by:
\(\widehat{\mathbf{W}} = (\mathbf{s}_0 + \Delta s) \cdot \overline{\mathbf{W}}_0 = (\mathbf{s}_0 + \Delta s) \cdot \left( \text{clamp} \left( \left\lfloor \frac{\mathbf{W}_0}{\mathbf{s}_0} \right\rfloor + \mathbf{z}_0, 0, 2^b - 1 \right) - \mathbf{z}_0 \right)\) where \(\overline{\mathbf{W}}_0\) is frozen, and \(\Delta \mathbf{s} \in \mathbb{R}^{n \times 1}\) represents the gradient update of \(\mathbf{s}_0\) obtained by adaptation to a downstream task.

Conclusion

PEQA presents a memory-efficient fine-tuning approach for quantized LLMs (weight-only quantization). By updating only the quantization scales while keeping integer matrices fixed, PEQA achieves:

Comparable accuracy to full-precision PEFT methods
Significant memory savings (up to 4× reduction)
Seamless adaptation to new tasks
Faster inference without additional post-processing

PEQA enables scalable and efficient model adaptation for large-scale language models, ensuring practical deployment on memory-constrained devices.