Continuous Space MDPs
Last Time
What are the differences between online and offline solutions?
Are there solution techniques that are independent of the state space size?
Guiding Questions
-
What tools do we have to solve MDPs with continuous \(S\) and \(A\)?
Current Tool-Belt
Continuous \(S\) and \(A\)
e.g. \(S \subseteq \mathbb{R}^n\), \(A \subseteq \mathbb{R}^m\)
The old rules still work!
Today: Four Tools
1. Linear Dynamics, Quadratic Reward
If not Linear Quadratic...
Offline:
- Approximate Dynamic Programming (ADP)
- Policy Search
Online:
- Model Predictive Control (MPC)
- Sparse Tree Search/Progressive Widening
\[V_\theta\]
2. Value Function Approximation
\(V_\theta (s) = f_\theta (s)\) (e.g. neural network)
\(V_\theta (s) = \theta^\top \beta(s)\) (linear feature)
Fitted Value Iteration
while not converged
\(\theta \gets \theta'\)
\(\hat{V}' \gets B_{\text{approx}}[V_\theta]\)
\(\theta' \gets \text{fit}(\hat{V}')\)
\[B_{\text{MC}(N)} [V_\theta ](s) = \max_a \left(R(s, a) + \gamma \sum_{i = 1}^N V_\theta(G(s, a, w_i))\right)\]
Function Approximation
- Global: (e.g. Fourier, neural network)
- Local: (e.g. simplex interpolation)
Weighting of \(2^d\) points
Weighting of only \(d+1\) points!
Function Approximation: Mountain Car
(Fourier, 17 params)
(Polynomial, 28 params)
(Kernel, > 100 params)
Function Approximation
Policy Search
\[\underset{\theta}{\text{maximize}} \quad U(\pi_\theta)\]
Common Approaches
- Evolutionary Algorithms
- Cross Entropy Method
- Policy Gradient (will cover in RL section)
Break
What will a Monte Carlo Tree Search tree look like if run on a problem with continuous spaces?
3. Sparse Tree Search/Progressive Widening
add new branch if \(C < k N^\alpha\) (\(\alpha < 1\))
4. Model Predictive Control
\[\underset{a_{1:d},s_{1:d}}{\text{maximize}} \quad \sum_{t=1}^d \gamma^t R(s_t, a_t)\]
\[\text{subject to} \quad s_{t+1} = \text{E}[T(s_t, a_t)] \quad \forall t\]
\[\underset{a_{1:d},s^{(1:m)}_{1:d}}{\text{maximize}} \quad \frac{1}{m} \sum_{i=1}^m \sum_{t=1}^d \gamma^t R(s^{(i)}_t, a_t)\]
\[\text{subject to} \quad s_{t+1} = G(s_t^{(i)}, a_t^{(i)}, w_t^{(i)}) \quad \forall t, i\]
\[\underset{a_{1:d}^{(1:m)},s^{(1:m)}_{1:d}}{\text{maximize}} \quad \frac{1}{m} \sum_{i=1}^m \sum_{t=1}^d \gamma^t R(s^{(i)}_t, a_t^{(i)})\]
\[\text{subject to} \quad s_{t+1} = G(s_t^{(i)}, a_t, w_t^{(i)}) \quad \forall t, i\]
\[a_1^{(i)} = a_1^{(j)} \quad \forall i, j\]
(Use off-the-shelf optimization software, e.g. Ipopt)
Certainty-Equivalent
Open-Loop
Hindsight
Optimization
Guiding Questions
-
What tools do we have to solve MDPs with continuous \(S\) and \(A\)?
080 Continuous MDPs
By Zachary Sunberg
080 Continuous MDPs
- 226