What are the differences between online and offline solutions?
Are there solution techniques that are independent of the state space size?
What tools do we have to solve MDPs with continuous \(S\) and \(A\)?
e.g. \(S \subseteq \mathbb{R}^n\), \(A \subseteq \mathbb{R}^m\)
The old rules still work!
Offline:
Online:
\[V_\theta\]
\(V_\theta (s) = f_\theta (s)\) (e.g. neural network)
\(V_\theta (s) = \theta^\top \beta(s)\) (linear feature)
Fitted Value Iteration
while not converged
\(\theta \gets \theta'\)
\(\hat{V}' \gets B_{\text{approx}}[V_\theta]\)
\(\theta' \gets \text{fit}(\hat{V}')\)
\[B_{\text{MC}(N)} [V_\theta ](s) = \max_a \left(R(s, a) + \gamma \sum_{i = 1}^N V_\theta(G(s, a, w_i))\right)\]
Weighting of \(2^d\) points
Weighting of only \(d+1\) points!
(Fourier, 17 params)
(Polynomial, 28 params)
(Kernel, > 100 params)
\[\underset{\theta}{\text{maximize}} \quad U(\pi_\theta)\]
Common Approaches
What will a Monte Carlo Tree Search tree look like if run on a problem with continuous spaces?
add new branch if \(C < k N^\alpha\) (\(\alpha < 1\))
\[\underset{a_{1:d},s_{1:d}}{\text{maximize}} \quad \sum_{t=1}^d \gamma^t R(s_t, a_t)\]
\[\text{subject to} \quad s_{t+1} = \text{E}[T(s_t, a_t)] \quad \forall t\]
\[\underset{a_{1:d},s^{(1:m)}_{1:d}}{\text{maximize}} \quad \frac{1}{m} \sum_{i=1}^m \sum_{t=1}^d \gamma^t R(s^{(i)}_t, a_t)\]
\[\text{subject to} \quad s_{t+1} = G(s_t^{(i)}, a_t^{(i)}, w_t^{(i)}) \quad \forall t, i\]
\[\underset{a_{1:d}^{(1:m)},s^{(1:m)}_{1:d}}{\text{maximize}} \quad \frac{1}{m} \sum_{i=1}^m \sum_{t=1}^d \gamma^t R(s^{(i)}_t, a_t^{(i)})\]
\[\text{subject to} \quad s_{t+1} = G(s_t^{(i)}, a_t, w_t^{(i)}) \quad \forall t, i\]
\[a_1^{(i)} = a_1^{(j)} \quad \forall i, j\]
(Use off-the-shelf optimization software, e.g. Ipopt)
Certainty-Equivalent
Open-Loop
Hindsight
Optimization
What tools do we have to solve MDPs with continuous \(S\) and \(A\)?