Continuous Space MDPs

Last Time

  • What are the differences between online and offline solutions?

  • Are there solution techniques that are independent of the state space size?

Guiding Questions

  • What tools do we have to solve MDPs with continuous \(S\) and \(A\)?

Current Tool-Belt

Continuous \(S\) and \(A\)

e.g. \(S \subseteq \mathbb{R}^n\), \(A \subseteq \mathbb{R}^m\)

The old rules still work!

Today: Four Tools

1. Linear Dynamics, Quadratic Reward

If not Linear Quadratic...

Offline:

  • Approximate Dynamic Programming (ADP)
  • Policy Search

Online:

  • Model Predictive Control (MPC)
  • Sparse Tree Search/Progressive Widening

\[V_\theta\]

2. Value Function Approximation

\(V_\theta (s) = f_\theta (s)\)      (e.g. neural network)

\(V_\theta (s) = \theta^\top \beta(s)\)      (linear feature)

Fitted Value Iteration

while not converged

    \(\theta \gets \theta'\)

    \(\hat{V}' \gets B_{\text{approx}}[V_\theta]\)

    \(\theta' \gets \text{fit}(\hat{V}')\)

\[B_{\text{MC}(N)} [V_\theta ](s) = \max_a \left(R(s, a) + \gamma \sum_{i = 1}^N V_\theta(G(s, a, w_i))\right)\]

Function Approximation

  • Global: (e.g. Fourier, neural network)
  • Local: (e.g. simplex interpolation)

Weighting of \(2^d\) points

Weighting of only \(d+1\) points!

Function Approximation: Mountain Car

(Fourier, 17 params)

(Polynomial, 28 params)

(Kernel, > 100 params)

Function Approximation

Policy Search

\[\underset{\theta}{\text{maximize}} \quad U(\pi_\theta)\]

Common Approaches

  • Evolutionary Algorithms
  • Cross Entropy Method
  • Policy Gradient (will cover in RL section)

Break

What will a Monte Carlo Tree Search tree look like if run on a problem with continuous spaces?

3. Sparse Tree Search/Progressive Widening

add new branch if \(C < k N^\alpha\)     (\(\alpha < 1\))

4. Model Predictive Control

\[\underset{a_{1:d},s_{1:d}}{\text{maximize}} \quad \sum_{t=1}^d \gamma^t R(s_t, a_t)\]

\[\text{subject to} \quad s_{t+1} = \text{E}[T(s_t, a_t)] \quad \forall t\]

\[\underset{a_{1:d},s^{(1:m)}_{1:d}}{\text{maximize}} \quad \frac{1}{m} \sum_{i=1}^m \sum_{t=1}^d \gamma^t R(s^{(i)}_t, a_t)\]

\[\text{subject to} \quad s_{t+1} = G(s_t^{(i)}, a_t^{(i)}, w_t^{(i)}) \quad \forall t, i\]

\[\underset{a_{1:d}^{(1:m)},s^{(1:m)}_{1:d}}{\text{maximize}} \quad \frac{1}{m} \sum_{i=1}^m \sum_{t=1}^d \gamma^t R(s^{(i)}_t, a_t^{(i)})\]

\[\text{subject to} \quad s_{t+1} = G(s_t^{(i)}, a_t, w_t^{(i)}) \quad \forall t, i\]

\[a_1^{(i)} = a_1^{(j)} \quad \forall i, j\]

(Use off-the-shelf optimization software, e.g. Ipopt)

Certainty-Equivalent

Open-Loop

Hindsight

Optimization

Guiding Questions

  • What tools do we have to solve MDPs with continuous \(S\) and \(A\)?

080 Continuous MDPs

By Zachary Sunberg

080 Continuous MDPs

  • 226