State Space Models
Table of Contents
Introduction
This document provides the mathematic derivations of the State Space Models that help readers understand the formulas. The contents are extracted and summarized from the original papers cited in references.
Continuous State Space Model
Consider a continuous state space model equation:
\[\label{c_ssm} \tag{1} \Large \begin{aligned} \dot{h}(t)=Ah(t)+Bx(t) \\ y(t)=Ch(t)+Dx(t) \end{aligned}\]There are three continuous signals in the system: \(x(t), h(t), y(t)\), which are:
-
\(x(t)\in \mathbb{R}^p\) is the
input (or control) vector
that describes the input continuous signal. Some papers would use \(u\) as the input vector, but we use \(x\) for better aligning with Mamba series. -
\(h(t)\in \mathbb{R}^n\) is the
state vector
that describes the system states, and \(\dot{h}\) is the system dynamics i.e. \(\frac{\partial h(t)}{\partial t}\). Some papers would use \(x\) as the state vector, but we use \(h\) for better aligning with Mamba series. -
\(y(t) \in \mathbb{R}^q\) is the
output vector
that describes the output continuous signal from the system.
There are four transition matrices in the system: \(A, B, C,\) and \(D\), which are:
-
\(A, \texttt{Dim}(A)=n\times n\) is the
state matrix
of the system that describes the linear relations of the parameters in the system. -
\(B, \texttt{Dim}(B)=n\times p\) is the
input matrix
that describes how input signals interact with the system. -
\(C, \texttt{Dim}(C)=q\times n\) is the
output matrix
that describes the system output. -
\(D, \texttt{Dim}(D)=q\times p\) is the
feed-forward matrix
that describes how input signals are directly involved in the output.
Discretize Continuous State Space Model
The general formulation of an ODE is \(\begin{aligned} \frac{d}{dt}h(t) = A h(t) + B x(t). \end{aligned}\)
From the differential equation, we have \(\begin{aligned} h(t+\Delta t) - h(t) = \int_{s=t}^{s=t+\Delta t} A h(s) + B x(s)ds . \end{aligned}\)
Bilinear Transform
By the Trapezoidal rule, we have
\[\begin{aligned} h(t+\Delta t) - h(t) &= \Delta t \frac{(Ah(t) + Bx(t) + Ah(t+\Delta t) + B x(t+\Delta t))}{2} \\ &= \frac{\Delta t}{2} Ah(t) + \frac{\Delta t}{2}Bx(t) + \frac{\Delta t}{2}Ah(t+\Delta t) + \frac{\Delta t}{2}Bx(t+\Delta t) \end{aligned}\]By combining the terms, we have
\[\begin{aligned} \textcolor{red}{h(t+\Delta t)} - \frac{\Delta t}{2}A\textcolor{red}{h(t+\Delta t)} &= \textcolor{green}{h(t)} + \frac{\Delta t}{2} A \textcolor{green}{h(t)} + \textcolor{blue}{\frac{\Delta t}{2}B} x(t) + \textcolor{blue}{\frac{\Delta t}{2}B} x(t+\Delta t) \\ ( \mathbf{I} - \frac{\Delta t}{2}A)h(t+\Delta t) &= ( \mathbf{I}+ \frac{\Delta t}{2} A)h(t) + \frac{\Delta t}{2} B(x(t) + x(t+\Delta t)) \\ h(t+\Delta t) &= ( \mathbf{I} - \frac{\Delta t}{2}A)^{-1} ( \mathbf{I}+ \frac{\Delta t}{2} A) h(t) + \frac{\Delta t}{2} ( \mathbf{I} - \frac{\Delta t}{2}A)^{-1} B (x(t) + x(t+\Delta t)) \end{aligned}\]Since \(x\) is the constant and does no involve in the integral, we have \(x(t) = x(t+\Delta t)\)
\[\begin{aligned} h(t+\Delta t) &= ( \mathbf{I} - \frac{\Delta t}{2}A)^{-1} ( \mathbf{I}+ \frac{\Delta t}{2} A) h(t) + \Delta t ( \mathbf{I} - \frac{\Delta t}{2} A)^{-1} B x(t+\Delta t) \end{aligned}\]We can replace \(\frac{1}{2}\) with a variable \(\alpha\), so that we have Generalized Bilinear Transformation (GBT).
\[\begin{aligned} h(t+\Delta t) &= \textcolor{red}{( \mathbf{I} - \Delta t \alpha A)^{-1} ( \mathbf{I}+ \Delta t \alpha A)} h(t) + \textcolor{green}{\Delta t ( \mathbf{I} - \Delta t \alpha A)^{-1} B} x(t+\Delta t) \end{aligned}\]We can write the discretized state space model as
\[\begin{aligned} h[t+1] &= \overline{A} h[t] + \overline{B} x[t+1] \\ y[t+1] &= Ch[t+1]+Dx[t+1] \end{aligned}\]where \(\overline{A}, \overline{B}\) are discretized matrices
\[\begin{aligned} \overline{A} &= \textcolor{red}{(\mathbf{I} - \Delta t \alpha A)^{-1} ( \mathbf{I}+ \Delta t \alpha A)} \\ \overline{B} &= \textcolor{green}{\Delta t ( \mathbf{I} - \Delta t \alpha A)^{-1} B} \end{aligned}\]Zero-order Hold Transform (ZOH)
State solution
We derive the state solution from the continuous time-invariant dynamic equation \(\frac{d}{dt}h(t) = A h(t) + Bx(t)\) and move \(A h(t)\) to the left-hand side
\[\begin{aligned} \textcolor{red}{ \frac{d}{dt}h(t) - A h(t) } = Bx(t) . \end{aligned}\]Using the fact
\[\begin{aligned} \textcolor{red}{\frac{d}{dt}[e^{-At}h(t)]} = e^{-At} \frac{d}{dt}h(t) - e^{-At} A h(t) = \textcolor{green}{e^{-At}} [ \textcolor{red}{\frac{d}{dt}h(t) - A h(t)} ] \end{aligned}\], we have
\[\begin{aligned} \textcolor{red}{\frac{d}{dt}[e^{-At}h(t)]} = \textcolor{green}{e^{-At}} Bx(t) \\ \end{aligned}\]We integrate the both side from \(0\) to \(t\)
\[\begin{aligned} \int_0^t \frac{d}{d \tau}[e^{-A\tau}h(\tau)] d\tau = \textcolor{red}{e^{-At}h(t) - h(0) = \int_0^t e^{-A\tau} Bx(\tau) d \tau} \\ \end{aligned}\]and move \(h(0)\) to the right-hand side
\[\begin{aligned} e^{-At}h(t) = h(0) + \int_0^t e^{-A\tau} Bx(\tau) d \tau \end{aligned}\]After multiplying \(e^{At}\) on both sides, we have the state solution
\[\label{state_solution} \tag{eq:1} \begin{aligned} h(t) = e^{At}h(0) + \int_0^t e^{A(t-\tau)} Bx(\tau) d \tau \end{aligned}\]Note that \(\int_0^t e^{A(t-\tau)} Bx(\tau) d \tau\) is a convolution, which can be done efficiently by multiplying the scalars in the frequency domain. We will use this result to derive the Zero-order Hold discretization.
Zero-order Hold Discretization
We first define the relationship between discretized signals and continuous signals.
\[\begin{aligned} h[k] := h(k\Delta t), \quad \text{and} \quad x[k] := x(k\Delta t) \end{aligned}\]where \(h[x]\) and \(c[k]\) is a zero-order transformed signals from \(h(x)\) and \(x(x)\) with a sampling step \(k\).
Then using [eq:1], we have
\[\begin{aligned} h[k+1] &:= h(\textcolor{red}{(k+1)\Delta t}) \\ &= e^{A\textcolor{red}{(k+1)\Delta t}}h(0) + \int_0^{\textcolor{red}{(k+1)\Delta t}} e^{A(\textcolor{red}{(k+1)\Delta t}-\tau)} Bx(\tau) d \tau \\ &= e^{A(k+1)\Delta t}h(0) + \int_{\textcolor{green}{0}}^{\textcolor{green}{k\Delta t}} e^{A((k+1)\Delta t-\tau)} Bx(\tau) d \tau + \int_{\textcolor{green}{k\Delta t}}^{\textcolor{green}{(k+1)\Delta t}} e^{A((k+1)\Delta t-\tau)} Bx(\tau) d \tau \\ &= \textcolor{red}{e^{A\Delta t}}e^{Ak\Delta t}h(0) + \int_0^{k\Delta t} \textcolor{red}{e^{A\Delta t}} e^{A(k\Delta t-\tau)} Bx(\tau) d \tau + \int_{k\Delta t}^{(k+1)\Delta t} e^{A((k+1)\Delta t-\tau)} Bx(\tau) d \tau \\ &= e^{A\Delta t}\left( \textcolor{green}{ e^{Ak\Delta t}h(0) + \int_0^{k\Delta t} e^{A(k\Delta t-\tau)} Bx(\tau) d \tau} \right) + \int_{k\Delta t}^{(k+1)\Delta t} e^{A((k+1)\Delta t-\tau)} Bx(\tau) d \tau \\ &= e^{A\Delta t} \textcolor{green}{h(k\Delta t)} + \int_{k\Delta t}^{(k+1)\Delta t} e^{A((k+1)\Delta t-\tau)} Bx(\tau) d \tau \\ &= e^{A\Delta t} h[k] + e^{A((k+1)\Delta t} B \textcolor{red}{\int_{k\Delta t}^{(k+1)\Delta t} e^{-A\tau} x(\tau) d \tau} \\ &= e^{A\Delta t} h[k] + e^{A((k+1)\Delta t} B (\textcolor{red}{-A^{-1}e^{-A\tau}\Big|_{k\Delta t}^{(k+1)\Delta t}})x((k+1)\Delta t) \\ &= e^{A\Delta t} h[k] + e^{A((k+1)\Delta t} B (\textcolor{red}{-A^{-1}e^{-A(k+1)\Delta t} + -A^{-1}e^{-A k\Delta t}}) x((k+1)\Delta t) \\ &= e^{A\Delta t} h[k] + e^{A((k+1)\Delta t} B (\textcolor{red}{-A^{-1}})(e^{-A(k+1)\Delta t} + e^{-A k\Delta t}) x((k+1)\Delta t) \\ &= e^{A\Delta t} h[k] + \textcolor{green}{e^{A((k+1)\Delta t}} B (-A^{-1}) (\textcolor{green}{e^{-A(k+1)\Delta t} + e^{-A k\Delta t}}) x((k+1)\Delta t) \\ &= e^{A\Delta t} h[k] + BA^{-1} (\textcolor{green}{-\mathbf{I} + e^{A \Delta t}}) x[k+1] \\ &= \textcolor{red}{e^{A\Delta t}} h[k] + \textcolor{green}{\frac{\Delta t B}{\Delta t A} (e^{A \Delta t} - \mathbf{I})} x[k+1] \\ \end{aligned}\]Note that \(x\) is constant from \(k\Delta t\) to \((k+1)\Delta t\), and thereby \(x(k\Delta t) = x((k+1)\Delta t)\)
We can write the discretized state space model as
\[\begin{aligned} h[t+1] &= \overline{A} h[t] + \overline{B} x[t+1] \\ y[t+1] &= Ch[t+1]+Dx[t+1] \end{aligned}\]where \(\overline{A}, \overline{B}\) are discretized matrices
\[\begin{aligned} \overline{A} &= \textcolor{red}{e^{A\Delta t}} \\ \overline{B} &= \textcolor{green}{ {(A \Delta t)}^{-1} (e^{A \Delta t} - \mathbf{I}) \Delta t B} \end{aligned}\]We can perform an Euler approximation to simplify \(B\). The first-order Taylor expansion of \(e^{A \Delta t}\) around zero is given by:
\[\begin{aligned} e^{A \Delta t} \approx I + A \Delta t \end{aligned}\]Therefore, we have
\[\begin{aligned} \overline{B} &= {(A \Delta t)}^{-1} (\textcolor{red}{e^{A \Delta t}} - \mathbf{I}) \Delta t B \approx {(A \Delta t)}^{-1} (\textcolor{red}{I + A \Delta t} - \mathbf{I}) \Delta t B = \Delta t B \end{aligned}\]Discrete-time State Space Model
We can write the continuous-time state space model
\[\begin{aligned} \dot{x}(t)=Ax(t)+Bu(t) \\ y(t)=Cx(t)+Du(t) \end{aligned}\]as a discrete-time state space model
\[\begin{aligned} x[t] &= \overline{A} x[t-1] + \overline{B} u[t] \\ y[t] &= C x[t] + D u[t] \\ \end{aligned}\]by using the GBL or ZOH
\[\begin{aligned}[c] &\text{Bilinear:}\\ \overline{A}&= (\mathbf{I} - \Delta t \alpha A)^{-1} ( \mathbf{I}+ \Delta t \alpha A)\\ \overline{B} &= \Delta t ( \mathbf{I} - \Delta t \alpha A)^{-1} B \end{aligned} \quad\quad \begin{aligned}[c] &\text{ZOH:}\\ \overline{A} &= e^{A\Delta t}\\ \overline{B} &= {(\Delta t A)}^{-1} (e^{A \Delta t} - \mathbf{I}) \Delta t B \approx \Delta t B\\ \end{aligned}\]Convolution Form
Letting \(h_{-1} = 0\) and unrolling the \(y_k\)
\[\begin{aligned}[c] &\text{t=0}\\ h_0 &= \overline{B}x_0\\ y_0 &= C\overline{B}x_0 + Dx_0\\ \end{aligned} \quad \begin{aligned}[c] &\text{t=1}\\ h_1 &= \overline{A}\overline{B}x_0 + \overline{B}x_1\\ y_1 &= C\overline{A}\overline{B}x_0 + C\overline{B}x_1 + Dx_1\\ \end{aligned} \quad \begin{aligned}[c] &\text{t=2}\\ h_2 &= \overline{A}^2\overline{B}x_0 + \overline{A}\overline{B}x_1 + \overline{B}x_2\\ y_2 &= C\overline{A}^2\overline{B}x_0 + C\overline{A}\overline{B}x_1 + C\overline{B}x_2 + Dx_2\\ \end{aligned}\]Therefore, when \(t=k\)
\[\begin{aligned} h_k &= \overline{A}^k\overline{B}x_0 +...+ \overline{A}\overline{B}x_{k-1} + \overline{B}x_k\\ y_k &= C\overline{A}^k\overline{B}x_0 + C\overline{A}^{k-1}\overline{B}x_1 +...+ C\overline{A}\overline{B}x_{k-1} + C\overline{B}x_k + Dx_k \\ \end{aligned}\]In other words, \(y_t\) is a single (non-circular) convolution
\[\begin{aligned} y_t = \sum_{\tau=0}^k \overline{C}\overline{A}^{\tau}\overline{B} x_{t-\tau} + Dx_t \end{aligned}\]which can be computed very efficiently with FFTs, provided that \(\overline{K}\) is known.
\(\begin{aligned} y_t = \overline{K} \ast x + Dx_t \end{aligned}\) where \(\begin{aligned} \overline{K} \in \mathbb{R}^L := \kappa_L(\overline{A}, \overline{B}, \overline{C}) := (\overline{C}\overline{A}^i\overline{B})_{i \in [L]} = (\overline{C}\overline{B}, \overline{C}\overline{A}\overline{B}, ..., , \overline{C}\overline{A}^{L-1}\overline{B}) \end{aligned}\)
Parallel Associative Scan
Consider an example for illustrating a parallel associative scan (or parallel prefix sum/scan): \(\begin{aligned} h_t &= \overline{A} h_{t-1} + \overline{B} x_t \\ \end{aligned}\)
We unroll the recursive equation with \(h_0 = 0\).
\[\begin{aligned} h_1 &= \overline{B} x_1 \\ h_2 &= \overline{A} h_1 + \overline{B} x_2 = \overline{A}\overline{B} x_1 + \overline{B} x_2\\ h_3 &= \overline{A} h_2 + \overline{B} x_3 = \overline{A}(\overline{A}\overline{B} x_1 + \overline{B} x_2) + \overline{B} x_3 = \overline{A}^2\overline{B} x_1 + \overline{A}\overline{B} x_2 + \overline{B} x_3\\ h_4 &= \overline{A} h_3 + \overline{B} x_4 = \overline{A}(\overline{A}^2\overline{B} x_1 + \overline{A}\overline{B} x_2 + \overline{B} x_3) + \overline{B} x_4 = \overline{A}^3\overline{B} x_1 + \overline{A}^2\overline{B} x_2 + \overline{A}\overline{B} x_3 + \overline{B} x_4\\ \end{aligned}\]Define an operand \(e_i = (\overline{A}, \overline{B}x_i)\) and operator \(e_i \cdot e_j\) for parallel associative scan such that
\[\begin{aligned} e_1 \cdot e_2 &= (\overline{A}, \overline{B}x_1) \cdot (\overline{A}, \overline{B}x_2) = (\overline{A}^2, \overline{A}\overline{B}x_1 + \overline{B}x_2) = (\overline{A}^2, h_2) \\ e_1 \cdot e_2 \cdot e_3 &= (\overline{A}^2, \overline{A}\overline{B}x_1 + \overline{B}x_2) \cdot (\overline{A}, \overline{B}x_3) = (\overline{A}^3, \overline{A}^2\overline{B} x_1 + \overline{A}\overline{B} x_2 + \overline{B} x_3) = (\overline{A}^2, h_3) \\ \end{aligned}\]Therefore, we can parallelize the computation with a parallel associative scan into a tree structure, for example, \(h_4\) can be computed by
\[\begin{aligned} (e_1 \cdot e_2 \cdot e_3 \cdot e_4) &= (e_1 \cdot e_2) \cdot (e_3 \cdot e_4) \\ &= (\overline{A}^2, \overline{A}\overline{B}x_1 + \overline{B}x_2) \cdot (\overline{A}^2, \overline{A}\overline{B}x_3 + \overline{B}x_4) \\ & = (\overline{A}^4, \overline{A}^3\overline{B} x_1 + \overline{A}^2\overline{B} x_2 + \overline{A}\overline{B} x_3 + \overline{B} x_4) \\ & = (\overline{A}^4, h_4) \\ \end{aligned}\]References
[1] A. Gu, T. Dao, S. Ermon, A. Rudra, and C. Ré, “Hippo: Recurrent memory with optimal polynomial projections,” Advances in neural information processing systems, vol. 33, pp. 1474– 1487, 2020.
[2] A. Gu, K. Goel, A. Gupta, and C. Ré, “On the parameterization and initialization of diagonal state space models,” Advances in Neural Information Processing Systems, vol. 35, pp. 35971–35983, 2022.
[3] A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023.
[4] https://en.wikipedia.org/wiki/State-space_representation