Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization

[TL;DR]
PEQA is a novel fine-tuning approach that integrates Parameter-Efficient Fine-Tuning (PEFT) with weight-only quantized LLMs by updating only the quantization scales, preserving low-bit integer weight matrices. This results in huge memory savings, seamless task adaptation, and inference acceleration.

Highlights

  • Integrates PEFT with quantized LLMs, updating only the quantization scales while keeping integer matrices frozen.
  • Reduces memory consumption during fine-tuning and deployment, making LLM adaptation feasible even for resource-constrained settings.
  • Maintains quantization benefits post fine-tuning, ensuring accelerated inference.
  • Demonstrates resilience in performance recovery, even for sub-4-bit quantized models, on large-scale instruction datasets.
  • Scales up to 65B parameter models while achieving performance close to full-precision LoRA fine-tuning.


Summary

  • Problem Statement: LLM fine-tuning is memory-intensive, even with PEFT such as LoRA, as full-precision weights remain a bottleneck. Quantization can reduce memory but is typically applied post-training, which limits the adaptability.
  • Solution: PEQA bridges this gap by fine-tuning only the quantization scales of a pre-quantized LLM while keeping the integer weights frozen. This enables task-specific adaptation with minimal overhead.
  • PEQA Framework:
    • Step 1 Decomposition: Pre-trained model weights are quantized into sub-4-bit integers with associated scaling factors.
    • Step 2 Fine-tuning: Only the quantization scales are updated while maintaining the frozen integer matrix, significantly reducing learnable parameters.


Key Advantages

Memory Efficiency

  • Fine-tunes only the quantization scales, significantly reducing memory overhead.
  • Optimized for low-bit integer quantization (≤ 4-bit) while maintaining high accuracy.

Seamless Task Switching

  • PEQA enables quick and efficient adaptation across different tasks by swapping quantization scales instead of retraining entire models.

Faster Inference

  • The frozen integer matrix remains intact, ensuring post-fine-tuning speedup using quantized inference kernels.

Experiments

Memory and General Comparison


PEQA vs. QAT vs. PEFT+PTQ

  • PEQA achieves performance close to QAT, significantly outperforming LoRA + PTQ at 3-bit and 4-bit precision.
  • Lower perplexity indicates effective fine-tuning of quantized models without sacrificing accuracy.


Instruction-Tuning with Alpaca Dataset

  • Evaluated on common-sense reasoning and in-context learning tasks (ARC, PIQA, HellaSwag).
  • Performance comparable to LoRA, with additional memory savings and inference acceleration.


Notations

Quantized Weights and Fine-Tuning

  • Weight-only asymmetric quantization:
    Given a fully-connected layer \(\mathbf{W}_0 \in \mathbb{R}^{n \times m}\), a given bit-width \(b\), per-channel scales and zero-points \(\mathbf{s}_0, \mathbf{z}_0 \in \mathbb{R}^{n \times 1}\), asymmetric quantized pre-trained weights \(\widehat{\mathbf{W}}_0\) can be written as
\[\widehat{\mathbf{W}}_0 = \mathbf{s}_0 \cdot \overline{\mathbf{W}}_0 = \mathbf{s}_0 \cdot \left( \text{clamp} \left( \left\lfloor \frac{\mathbf{W}_0}{\mathbf{s}_0} \right\rfloor + \mathbf{z}_0, 0, 2^b - 1 \right) - \mathbf{z}_0 \right),\]
  • PEQA fine-tuning modifies only the quantization scale by:
    \(\widehat{\mathbf{W}} = (\mathbf{s}_0 + \Delta s) \cdot \overline{\mathbf{W}}_0 = (\mathbf{s}_0 + \Delta s) \cdot \left( \text{clamp} \left( \left\lfloor \frac{\mathbf{W}_0}{\mathbf{s}_0} \right\rfloor + \mathbf{z}_0, 0, 2^b - 1 \right) - \mathbf{z}_0 \right)\) where \(\overline{\mathbf{W}}_0\) is frozen, and \(\Delta \mathbf{s} \in \mathbb{R}^{n \times 1}\) represents the gradient update of \(\mathbf{s}_0\) obtained by adaptation to a downstream task.

Conclusion

PEQA presents a memory-efficient fine-tuning approach for quantized LLMs (weight-only quantization). By updating only the quantization scales while keeping integer matrices fixed, PEQA achieves:

  • Comparable accuracy to full-precision PEFT methods
  • Significant memory savings (up to 4× reduction)
  • Seamless adaptation to new tasks
  • Faster inference without additional post-processing

PEQA enables scalable and efficient model adaptation for large-scale language models, ensuring practical deployment on memory-constrained devices.