Quamba: Post-training Quantization for Selective State Space Models
(Under reviewing, more details are coming soon...)
8-bit quantization (W8A8) for Mamba blocks 1.7 × speedup on Orin Nano 8G 2× memory reduction
Real-time Generation on Edge GPUs
We compared Quamba 2.8B with Mamba 2.8B on a NVIDIA Orin Nano 8G. Quamba (W8A8) is \(1.7\times\) faster than Mamba (FP16) on the Nano. The real-time generation speed is shown in the demo.
Long Input Sequences on Edge GPUs
We compared Quamba with an 8-bit transformer on a NVIDIA Orin Nano 8G. Quamba is capable of handling long input sequences (over 8k tokens) with limited memory and computational resources on edge devices.
Zero-shot Accuracy
Zero-shot accuracy of quantized models on six common sense tasks. Quamba is a static per-tensor quantization method that closes the performance gap and outperforms the same-sized Transformers (Pythia) in accuracy. (Bold is the best, and underline is the second best.)