Quamba: A Post-Training Quantization Recipe for Selective State Space Models

1 The University of Texas at Austin, 2National Yang Ming Chiao Tung University
* Equal contribution
Paper   Code



:zap: 8-bit quantization (W8A8) for Mamba blocks     :rocket: 1.7 × speedup on Orin Nano 8G     :small_red_triangle_down: 2× memory reduction


Real-time Generation on Edge GPUs

We compared Quamba 2.8B with Mamba 2.8B on a NVIDIA Orin Nano 8G. Quamba (W8A8) is \(1.7\times\) faster than Mamba (FP16) on the Nano. The real-time generation speed is shown in the demo.


Long Input Sequences on Edge GPUs

We compared Quamba with an 8-bit transformer on a NVIDIA Orin Nano 8G. Quamba is capable of handling long input sequences (over 8k tokens) with limited memory and computational resources on edge devices.


Zero-shot Accuracy

Zero-shot accuracy of quantized models on six common sense tasks. Quamba is a static per-tensor quantization method that closes the performance gap and outperforms the same-sized Transformers (Pythia) in accuracy. (Bold is the best, and underline is the second best.)


Perplexity Evaluation

Perplexity results of different quantization methods applied on Mamba families. We evaluate the quantized models on a subset of Pile and Wikitext2 datasets. SmQ stands for SmoothQuant. Quamba is a static per-tensor quantization method that closes the performance gap in terms of perplexity and outperforms the same-sized Transformers (Pythia).


Quantizing Jamba: A Large-Scale Hybrid Mamba-Transformer LLM

Jamba is a hybrid transformer-mamba language model with 52B parameters, built with Self-attention, Mixture of Experts (MoE), and Mamba blocks. We experiment and combine off-the-shelf quantization methods with our method. The zero-shot LAMBADA accuracy is reported.

Citation

@article{chiang2024quamba,
  title={Quamba: A Post-Training Quantization Recipe for Selective State Space Models},
  author={Chiang, Hung-Yueh and Chang, Chi-Chih and Frumkin, Natalia and Wu, Kai-Chiang and Marculescu, Diana},
  journal={arXiv preprint arXiv:2410.13229},
  year={2024}
}


Acknowledgements

This work was supported in part by the ONR Minerva program, NSF CCF Grant No. 2107085, iMAGiNE - the Intelligent Machine Engineering Consortium at UT Austin, UT Cockrell School of Engineering Doctoral Fellowships, and Taiwan’s NSTC Grant No. 111-2221-E-A49-148-MY3.