Continuous Space MDPs

Last Time

  • Neural Network Function Approximation

Guiding Questions

  • What tools do we have to solve MDPs with continuous \(S\) and \(A\)?

Current Tool-Belt

Today: Four Tools

Notation: Continuous Random Variables

Term Definition Coinflip Example Uniform Example

\(\text{Bernoulli}(0.5)\)

\(\mathcal{U}([0,1])\)

support(\(X\))

All the values that \(X\) can take

\(\{h, t\}\) or \(\{0,1\}\)

\([0,1]\)

\(x \in X\)

Distribution

Maps each value in the support to a real number indicating its probability

\(P(X=1) = 0.5\)

\(P(X=0) = 0.5\)

\(P(X)\) is a table

X P(X)
0 0.5
1 0.5
  • Discrete: PMF
  • Continuous: PDF

\(p(x) = \mathbf{1}_{[0,1]}(x) = \begin{cases} 1 \text{ if } x \in [0, 1] \\ 0 \text{ o.w.} \end{cases}\)

\(P(X=0.5) = \)

\(P(X \in [a, b]) = \int_a^b p(x) dx\)

\(0\)

Expectation

First moment of the random variable, "mean"

\(E[X]\)

\[E[X] = \sum_{x \in X} x P(x)\]

\(=0.5\)

\[E[X] = \int_{x \in X} x \, p(x) \, dx\]

\(=0.5\)

Rules for Continuous RVs

1)

a) \(0 \leq P(X \mid Y) \leq 1\)

b) \(\sum_{x \in X} P(x \mid Y) = 1\)

2) \(P(X) = \sum_{y \in Y} P(X, y)\)

3) \(P(X \mid Y) = \frac{P(X, Y)}{P(Y)}\)

\(P(X, Y) = P(X \mid Y) \, P(Y)\)

1)

Discrete

Continuous

\[\int_X p(x | Y) \, dx = 1\]

2) \[p(X) = \int_{Y} p(X, y) dy\]

3)     \(p(X \mid Y) = \frac{p(X, Y)}{p(Y)}\)

\(p(X, Y) = p(X \mid Y) \, p(Y)\)

\(0 \leq p(X \mid Y)\)

Multivariate Gaussian Distribution

Joint Distribution

Conditional Distribution

Marginal Distribution

Continuous \(S\) and \(A\)

e.g. \(S \subseteq \mathbb{R}^n\), \(A \subseteq \mathbb{R}^m\)

The old rules still work!

1. Linear Dynamics, Quadratic Reward

\(S = \mathbb{R}^n\), \(A = \mathbb{R}^m\)

\(w \sim \mathcal{N}(0, \Sigma)\)

\(T(s' \mid s, a) = \mathcal{N}(T_s s + T_a a, \Sigma)\)

\(s' = T_s s + T_a a + w \)

\(R(s, a) = s^\top R_s s + a^\top R_a a\)

\[U_h^*(s) = \max_{\pi} \left[ \sum_{t=0}^{h} R(s_t, a_t) \right]\]

Finite Horizon:

\(\pi^*_h\) is "optimal h-step policy"

We will show that    \(U_h^*(s) = s^\top V_h s + q_h\)     and    \(\pi^*_h(s) = -K_h s\)

(Also works with other zero-mean \(w\).)

1. Linear Dynamics, Quadratic Reward

We will show that    \(U_h^*(s) = s^\top V_h s + q_h\)     and    \(\pi^*_h(s) = -K_h s\)

by induction.

Base: \(U^*_1(s) = \max_{a} \left( s^\top R_s s + a^\top R_a a \right) = s^\top R_s s\)

Inductive step: show that    if     \(U^*_t = s^\top V_t s + q_t\),     then     \(U^*_{t+1} = s^\top V_{t+1} s + q_{t+1}\).

\(U^*_{t+1}(s) = \max_{a} \left( R(s, a) + \gamma E\left[ U^*_t(s) \right] \right)\)

\(= \max_{a} \left( s^\top R_s s + a^\top R_a a + \int p(w) U_t(T_s s + T_a a + w)  dw \right)\)

\(= s^\top R_s s + \max_{a} \left( a^\top R_a a + \int p(w)(T_s s + T_a a + w)^\top V_t (T_s s + T_a a + w) + q_t \right) dw\)

\(= s^\top R_s s + s^\top T_s^\top V_t T_s s + \max_{a} \left( a^\top R_a a + 2s^\top T_s^\top V_t T_a a + a^\top T_a^\top V_t T_a a \right) + \int p(w) w^\top V_t w  dw + q_t\)

\(U^*_{t+1}(s) = s^\top \underbrace{\left( R_s + T_s^\top V_t T_s - (T_a^\top V_t T_s)^\top (R_a + T_a^\top V_t T_a)^{-1} (T_a^\top V_t T_s) \right)}_{V_{t+1}} s + \underbrace{\int p(w) w^\top V_t w  dw + q_t}_{q_{t+1}}\)

\(U^*_{t+1}(s) = s^\top V_{t+1} s + q_{t+1} \qquad \square\)

\(a^*\) is where \(\nabla_a (\text{max term}) = 0\)

\(0 = 2R_a a^* + 2T_a^\top V_t T_s s + 2T_a^\top V_t T_a a^*\)

\(a^* = -\underbrace{(R_a + T_a^\top V_t T_a)^{-1} T_a^\top V_t T_s}_{K_t} s\)

1. Linear Dynamics, Quadratic Reward

As \(h \to \infty\)

\(V_\infty = T_s^\top \left( V_\infty - V_\infty T_a \left( T_a^\top V_\infty T_a + R_a \right)^{-1} T_a^\top V_\infty \right) T_s + R_s\)

\(K_\infty = \left( T_a^\top V_\infty T_a + R_a \right)^{-1} T_a^\top V_\infty T_s\)

\(\pi^*_\infty (s) = -K_\infty s\)

(\(K_\infty\) has no dependence on \(\Sigma\))

Certainty-Equivalence Principle: For Linear-Quadratic problems, the optimal policy with noise is the same as the optimal policy without noise!

Practical Implication: If a continuous problem has roughly linear dynamics, a convex cost function, and roughly zero-mean additive noise, you can use certainty-equivalent control, i.e. control as if there is no noise.

If not Linear Quadratic...

Offline:

  • Approximate Dynamic Programming (ADP)
  • Policy Search

Online:

  • Model Predictive Control (MPC)
  • Sparse Tree Search/Progressive Widening

\[V_\theta\]

2. Value Function Approximation

\(V_\theta (s) = f_\theta (s)\)      (e.g. neural network)

\(V_\theta (s) = \theta^\top \beta(s)\)      (linear feature)

Fitted Value Iteration

while not converged

    \(\theta \gets \theta'\)

    \(\hat{V}' \gets B_{\text{approx}}[V_\theta]\)

    \(\theta' \gets \text{fit}(\hat{V}')\)

\[B_{\text{MC}(N)} [V_\theta ](s) = \max_a \left(R(s, a) + \gamma \sum_{i = 1}^N V_\theta(G(s, a, w_i))\right)\]

Function Approximation: Mountain Car

(Fourier, 17 params)

(Polynomial, 28 params)

(Kernel, > 100 params)

Function Approximation

  • Global: (e.g. Fourier, neural network)
  • Local: (e.g. simplex interpolation)

Weighting of \(2^d\) points

Weighting of only \(d+1\) points!

2. Value Function Approximation

2. Value Function Approximation

Policy Search

\[\underset{\theta}{\text{maximize}} \quad U(\pi_\theta)\]

Common Approaches

  • Evolutionary Algorithms
  • Cross Entropy Method
  • Policy Gradient (will cover in RL section)

Break

What will a Monte Carlo Tree Search tree look like if run on a problem with continuous spaces?

3. Sparse Tree Search/Progressive Widening

add new branch if \(C < k N^\alpha\)     (\(\alpha < 1\))

4. Model Predictive Control

\[\underset{a_{1:d},s_{1:d}}{\text{maximize}} \quad \sum_{t=1}^d \gamma^t R(s_t, a_t)\]

\[\text{subject to} \quad s_{t+1} = \text{E}_{s' \sim T(s' \mid s, a)}[s'] \quad \forall t\]

\[\underset{a_{1:d},s^{(1:m)}_{1:d}}{\text{maximize}} \quad \frac{1}{m} \sum_{i=1}^m \sum_{t=1}^d \gamma^t R(s^{(i)}_t, a_t)\]

\[\text{subject to} \quad s_{t+1} = G(s_t^{(i)}, a_t^{(i)}, w_t^{(i)}) \quad \forall t, i\]

\[\underset{a_{1:d}^{(1:m)},s^{(1:m)}_{1:d}}{\text{maximize}} \quad \frac{1}{m} \sum_{i=1}^m \sum_{t=1}^d \gamma^t R(s^{(i)}_t, a_t^{(i)})\]

\[\text{subject to} \quad s_{t+1} = G(s_t^{(i)}, a_t, w_t^{(i)}) \quad \forall t, i\]

\[a_1^{(i)} = a_1^{(j)} \quad \forall i, j\]

(Use off-the-shelf optimization software, e.g. Ipopt)

Certainty-Equivalent

Open-Loop

Hindsight

Optimization

Guiding Questions

  • What tools do we have to solve MDPs with continuous \(S\) and \(A\)?