Quamba: Post-training Quantization for Selective State Space Models

(Under reviewing, more details are coming soon...)

1 The University of Texas at Austin, 2National Yang Ming Chiao Tung University
* Equal contribution



:zap: 8-bit quantization (W8A8) for Mamba blocks     :rocket: 1.7 × speedup on Orin Nano 8G     :small_red_triangle_down: 2× memory reduction


Real-time Generation on Edge GPUs

We compared Quamba 2.8B with Mamba 2.8B on a NVIDIA Orin Nano 8G. Quamba (W8A8) is \(1.7\times\) faster than Mamba (FP16) on the Nano. The real-time generation speed is shown in the demo.


Long Input Sequences on Edge GPUs

We compared Quamba with an 8-bit transformer on a NVIDIA Orin Nano 8G. Quamba is capable of handling long input sequences (over 8k tokens) with limited memory and computational resources on edge devices.


Zero-shot Accuracy

Zero-shot accuracy of quantized models on six common sense tasks. Quamba is a static per-tensor quantization method that closes the performance gap and outperforms the same-sized Transformers (Pythia) in accuracy. (Bold is the best, and underline is the second best.)


Perplexity Evaluation

Perplexity results of different quantization methods applied on Mamba families. We evaluate the quantized models on a subset of Pile and Wikitext2 datasets. SmQ stands for SmoothQuant. Quamba is a static per-tensor quantization method that closes the performance gap in terms of perplexity and outperforms the same-sized Transformers (Pythia).


Quantizing Jamba: A Large-Scale Hybrid Mamba-Transformer LLM

Jamba is a hybrid transformer-mamba language model with 52B parameters, built with Self-attention, Mixture of Experts (MoE), and Mamba blocks. We experiment and combine off-the-shelf quantization methods with our method. The zero-shot LAMBADA accuracy is reported.