Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models

Arxiv
1The University of Texas at Austin, 2Cornell University, 3National Yang Ming Chiao Tung University
Paper   Code   🤗 Models


:wrench: Supports W4A8 / W4A16 / W4AX / W8A8 for Mamba1 and Mamba2
:small_red_triangle_down: Up to 4× memory reduction
:rocket: 13 Token-per-second on Orin Nano 8G with Mamba2-8b


4-bit Mamba1 and Mamba2 blocks

  • W4A8, W4A16, W4AX, and W8A8 for both Mamba1 and Mamba2
  • Headto-toe (H2T) 4/8-bit quantization from the embedding layer, SSM blocks, to the final output layer


Storage reduction

  • Achieve 4 \(\times\) memory reduction by Head-to-toe (H2T) 4-bit quantization
  • Enable deploying Mamba2-8B on Nano 8G


End-to-end latency speedup

  • Speedup the generation by 3 \(\times\) on the A5000 GPU
  • Run 13 tokens/second on Nano 8G


Generalization and robustness

We search W4A\(X\)-mixed (the last row in red) to improve the generalization and robustness for low bit-width SSMs. We evaluate low bit-width SSMs on the large multitask dataset MMLU.


Zero-shot evaluation


Citation

@article{chiang2025quamba2,
  title={Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models},
  author={Chiang, Hung-Yueh and Chang, Chi-Chih and Frumkin, Natalia and Wu, Kai-Chiang, Abdelfattah, Mohamed S.  and Marculescu, Diana},
  journal={arXiv preprint arXiv:2503.22879},
  year={2025}
}


Acknowledgements

This work was supported in part by the ONR Minerva program, NSF CCF Grant No. 2107085, iMAGiNE - the Intelligent Machine Engineering Consortium at UT Austin, UT Cockrell School of Engineering Doctoral Fellowships, NSF Grant No. 2339084, and Taiwan’s NSTC Grant No. 111-2221-E-A49-148-MY3.