Continuous Space MDPs
Last Time
- Neural Network Function Approximation
Guiding Questions
-
What tools do we have to solve MDPs with continuous \(S\) and \(A\)?
Current Tool-Belt
Today: Four Tools
Notation: Continuous Random Variables
| Term | Definition | Coinflip Example | Uniform Example |
|---|
\(\text{Bernoulli}(0.5)\)
\(\mathcal{U}([0,1])\)
support(\(X\))
All the values that \(X\) can take
\(\{h, t\}\) or \(\{0,1\}\)
\([0,1]\)
\(x \in X\)
Distribution
Maps each value in the support to a real number indicating its probability
\(P(X=1) = 0.5\)
\(P(X=0) = 0.5\)
\(P(X)\) is a table
| X | P(X) |
|---|---|
| 0 | 0.5 |
| 1 | 0.5 |
- Discrete: PMF
- Continuous: PDF
\(p(x) = \mathbf{1}_{[0,1]}(x) = \begin{cases} 1 \text{ if } x \in [0, 1] \\ 0 \text{ o.w.} \end{cases}\)
\(P(X=0.5) = \)
\(P(X \in [a, b]) = \int_a^b p(x) dx\)
\(0\)
Expectation
First moment of the random variable, "mean"
\(E[X]\)
\[E[X] = \sum_{x \in X} x P(x)\]
\(=0.5\)
\[E[X] = \int_{x \in X} x \, p(x) \, dx\]
\(=0.5\)
Rules for Continuous RVs
1)
a) \(0 \leq P(X \mid Y) \leq 1\)
b) \(\sum_{x \in X} P(x \mid Y) = 1\)
2) \(P(X) = \sum_{y \in Y} P(X, y)\)
3) \(P(X \mid Y) = \frac{P(X, Y)}{P(Y)}\)
\(P(X, Y) = P(X \mid Y) \, P(Y)\)
1)
Discrete
Continuous
\[\int_X p(x | Y) \, dx = 1\]
2) \[p(X) = \int_{Y} p(X, y) dy\]
3) \(p(X \mid Y) = \frac{p(X, Y)}{p(Y)}\)
\(p(X, Y) = p(X \mid Y) \, p(Y)\)
\(0 \leq p(X \mid Y)\)
Multivariate Gaussian Distribution
Joint Distribution
Conditional Distribution
Marginal Distribution
Continuous \(S\) and \(A\)
e.g. \(S \subseteq \mathbb{R}^n\), \(A \subseteq \mathbb{R}^m\)
The old rules still work!
1. Linear Dynamics, Quadratic Reward
\(S = \mathbb{R}^n\), \(A = \mathbb{R}^m\)
\(w \sim \mathcal{N}(0, \Sigma)\)
\(T(s' \mid s, a) = \mathcal{N}(T_s s + T_a a, \Sigma)\)
\(s' = T_s s + T_a a + w \)
\(R(s, a) = s^\top R_s s + a^\top R_a a\)
\[U_h^*(s) = \max_{\pi} \left[ \sum_{t=0}^{h} R(s_t, a_t) \right]\]
Finite Horizon:
\(\pi^*_h\) is "optimal h-step policy"
We will show that \(U_h^*(s) = s^\top V_h s + q_h\) and \(\pi^*_h(s) = -K_h s\)
(Also works with other zero-mean \(w\).)
1. Linear Dynamics, Quadratic Reward
We will show that \(U_h^*(s) = s^\top V_h s + q_h\) and \(\pi^*_h(s) = -K_h s\)
by induction.
Base: \(U^*_1(s) = \max_{a} \left( s^\top R_s s + a^\top R_a a \right) = s^\top R_s s\)
Inductive step: show that if \(U^*_t = s^\top V_t s + q_t\), then \(U^*_{t+1} = s^\top V_{t+1} s + q_{t+1}\).
\(U^*_{t+1}(s) = \max_{a} \left( R(s, a) + \gamma E\left[ U^*_t(s) \right] \right)\)
\(= \max_{a} \left( s^\top R_s s + a^\top R_a a + \int p(w) U_t(T_s s + T_a a + w) dw \right)\)
\(= s^\top R_s s + \max_{a} \left( a^\top R_a a + \int p(w)(T_s s + T_a a + w)^\top V_t (T_s s + T_a a + w) + q_t \right) dw\)
\(= s^\top R_s s + s^\top T_s^\top V_t T_s s + \max_{a} \left( a^\top R_a a + 2s^\top T_s^\top V_t T_a a + a^\top T_a^\top V_t T_a a \right) + \int p(w) w^\top V_t w dw + q_t\)
\(U^*_{t+1}(s) = s^\top \underbrace{\left( R_s + T_s^\top V_t T_s - (T_a^\top V_t T_s)^\top (R_a + T_a^\top V_t T_a)^{-1} (T_a^\top V_t T_s) \right)}_{V_{t+1}} s + \underbrace{\int p(w) w^\top V_t w dw + q_t}_{q_{t+1}}\)
\(U^*_{t+1}(s) = s^\top V_{t+1} s + q_{t+1} \qquad \square\)
\(a^*\) is where \(\nabla_a (\text{max term}) = 0\)
\(0 = 2R_a a^* + 2T_a^\top V_t T_s s + 2T_a^\top V_t T_a a^*\)
\(a^* = -\underbrace{(R_a + T_a^\top V_t T_a)^{-1} T_a^\top V_t T_s}_{K_t} s\)
1. Linear Dynamics, Quadratic Reward
As \(h \to \infty\)
\(V_\infty = T_s^\top \left( V_\infty - V_\infty T_a \left( T_a^\top V_\infty T_a + R_a \right)^{-1} T_a^\top V_\infty \right) T_s + R_s\)
\(K_\infty = \left( T_a^\top V_\infty T_a + R_a \right)^{-1} T_a^\top V_\infty T_s\)
\(\pi^*_\infty (s) = -K_\infty s\)
(\(K_\infty\) has no dependence on \(\Sigma\))
Certainty-Equivalence Principle: For Linear-Quadratic problems, the optimal policy with noise is the same as the optimal policy without noise!
Practical Implication: If a continuous problem has roughly linear dynamics, a convex cost function, and roughly zero-mean additive noise, you can use certainty-equivalent control, i.e. control as if there is no noise.
If not Linear Quadratic...
Offline:
- Approximate Dynamic Programming (ADP)
- Policy Search
Online:
- Model Predictive Control (MPC)
- Sparse Tree Search/Progressive Widening

\[V_\theta\]


2. Value Function Approximation
\(V_\theta (s) = f_\theta (s)\) (e.g. neural network)
\(V_\theta (s) = \theta^\top \beta(s)\) (linear feature)
Fitted Value Iteration
while not converged
\(\theta \gets \theta'\)
\(\hat{V}' \gets B_{\text{approx}}[V_\theta]\)
\(\theta' \gets \text{fit}(\hat{V}')\)
\[B_{\text{MC}(N)} [V_\theta ](s) = \max_a \left(R(s, a) + \gamma \sum_{i = 1}^N V_\theta(G(s, a, w_i))\right)\]


Function Approximation: Mountain Car




(Fourier, 17 params)
(Polynomial, 28 params)
(Kernel, > 100 params)
Function Approximation
- Global: (e.g. Fourier, neural network)
- Local: (e.g. simplex interpolation)




Weighting of \(2^d\) points
Weighting of only \(d+1\) points!
2. Value Function Approximation


2. Value Function Approximation
Policy Search


\[\underset{\theta}{\text{maximize}} \quad U(\pi_\theta)\]
Common Approaches
- Evolutionary Algorithms
- Cross Entropy Method
- Policy Gradient (will cover in RL section)

Break
What will a Monte Carlo Tree Search tree look like if run on a problem with continuous spaces?
3. Sparse Tree Search/Progressive Widening
add new branch if \(C < k N^\alpha\) (\(\alpha < 1\))

4. Model Predictive Control
\[\underset{a_{1:d},s_{1:d}}{\text{maximize}} \quad \sum_{t=1}^d \gamma^t R(s_t, a_t)\]
\[\text{subject to} \quad s_{t+1} = \text{E}_{s' \sim T(s' \mid s, a)}[s'] \quad \forall t\]
\[\underset{a_{1:d},s^{(1:m)}_{1:d}}{\text{maximize}} \quad \frac{1}{m} \sum_{i=1}^m \sum_{t=1}^d \gamma^t R(s^{(i)}_t, a_t)\]
\[\text{subject to} \quad s_{t+1} = G(s_t^{(i)}, a_t^{(i)}, w_t^{(i)}) \quad \forall t, i\]
\[\underset{a_{1:d}^{(1:m)},s^{(1:m)}_{1:d}}{\text{maximize}} \quad \frac{1}{m} \sum_{i=1}^m \sum_{t=1}^d \gamma^t R(s^{(i)}_t, a_t^{(i)})\]
\[\text{subject to} \quad s_{t+1} = G(s_t^{(i)}, a_t, w_t^{(i)}) \quad \forall t, i\]
\[a_1^{(i)} = a_1^{(j)} \quad \forall i, j\]
(Use off-the-shelf optimization software, e.g. Ipopt)
Certainty-Equivalent
Open-Loop
Hindsight
Optimization
Guiding Questions
-
What tools do we have to solve MDPs with continuous \(S\) and \(A\)?
080 Continuous MDPs
By Zachary Sunberg
080 Continuous MDPs
- 488