Continuous Space MDPs

Last Time

What are the differences between online and offline solutions?
Are there solution techniques that are independent of the state space size?

Guiding Questions

What tools do we have to solve MDPs with continuous \(S\) and \(A\)?

Current Tool-Belt

Continuous \(S\) and \(A\)

e.g. \(S \subseteq \mathbb{R}^n\), \(A \subseteq \mathbb{R}^m\)

The old rules still work!

Today: Four Tools

1. Linear Dynamics, Quadratic Reward

If not Linear Quadratic...

Offline:

Approximate Dynamic Programming (ADP)
Policy Search

Online:

Model Predictive Control (MPC)
Sparse Tree Search/Progressive Widening

\[V_\theta\]

2. Value Function Approximation

\(V_\theta (s) = f_\theta (s)\) (e.g. neural network)

\(V_\theta (s) = \theta^\top \beta(s)\) (linear feature)

Fitted Value Iteration

while not converged

\(\theta \gets \theta'\)

\(\hat{V}' \gets B_{\text{approx}}[V_\theta]\)

\(\theta' \gets \text{fit}(\hat{V}')\)

\[B_{\text{MC}(N)} [V_\theta ](s) = \max_a \left(R(s, a) + \gamma \sum_{i = 1}^N V_\theta(G(s, a, w_i))\right)\]

Function Approximation

Global: (e.g. Fourier, neural network)
Local: (e.g. simplex interpolation)

Weighting of \(2^d\) points

Weighting of only \(d+1\) points!

Function Approximation: Mountain Car

(Fourier, 17 params)

(Polynomial, 28 params)

(Kernel, > 100 params)

Function Approximation

Policy Search

\[\underset{\theta}{\text{maximize}} \quad U(\pi_\theta)\]

Common Approaches

Evolutionary Algorithms
Cross Entropy Method
Policy Gradient (will cover in RL section)

Break

What will a Monte Carlo Tree Search tree look like if run on a problem with continuous spaces?

3. Sparse Tree Search/Progressive Widening

add new branch if \(C < k N^\alpha\) (\(\alpha < 1\))

4. Model Predictive Control

\[\underset{a_{1:d},s_{1:d}}{\text{maximize}} \quad \sum_{t=1}^d \gamma^t R(s_t, a_t)\]

\[\text{subject to} \quad s_{t+1} = \text{E}[T(s_t, a_t)] \quad \forall t\]

\[\underset{a_{1:d},s^{(1:m)}_{1:d}}{\text{maximize}} \quad \frac{1}{m} \sum_{i=1}^m \sum_{t=1}^d \gamma^t R(s^{(i)}_t, a_t)\]

\[\text{subject to} \quad s_{t+1} = G(s_t^{(i)}, a_t^{(i)}, w_t^{(i)}) \quad \forall t, i\]

\[\underset{a_{1:d}^{(1:m)},s^{(1:m)}_{1:d}}{\text{maximize}} \quad \frac{1}{m} \sum_{i=1}^m \sum_{t=1}^d \gamma^t R(s^{(i)}_t, a_t^{(i)})\]

\[\text{subject to} \quad s_{t+1} = G(s_t^{(i)}, a_t, w_t^{(i)}) \quad \forall t, i\]

\[a_1^{(i)} = a_1^{(j)} \quad \forall i, j\]

(Use off-the-shelf optimization software, e.g. Ipopt)

Certainty-Equivalent

Open-Loop

Hindsight

Optimization

Guiding Questions

What tools do we have to solve MDPs with continuous \(S\) and \(A\)?