Online Methods

Last Time

  • Does value iteration always converge?

  • Is the value function unique?

Last Time

  • Value Iteration
  • Policy Iteration

Guiding Questions

  • What are the differences between online and offline solutions?

  • Are there solution techniques that require computation time independent of the state space size?

Why Do We Need Something Else?

  • Problems Policy and Value Iteration may struggle with?

    • Path planning across the country, or interplanetary

    • More realistic car dynamics (continuous states)

  • Why are these problems hard?
    • State Space is massive (or infinite)

Curse of Dimensionality

\(n\) dimensions, \(k\) segments \(\,\rightarrow \, |\mathcal{S}| = k^n\)

1 dimension, 5 segments

\(|\mathcal{S}| = 5\)

2 dimensions, 5 segments

\(|\mathcal{S}| = 25\)

3 dimensions, 5 segments

\(|\mathcal{S}| = 125\)

Offline vs Online Solutions

Offline

  • Before Execution: find \(V^*\)/\(Q^*\)
  • During Execution: \(\pi^*(s) = \text{argmax} \, Q^*(s, a)\)

 

Online

  • Before Execution: <nothing>
  • During Execution: Consider actions and their consequences (everything)
  • Why?
  • Online methods are insensitive to the size of \(S\) !

One Step Lookahead

Forward Search

\(O\left((|S|\times|A|)^d\right)\)

Forward Search depth

Sparse Sampling

\(O\left((m|A|)^d\right)\)

\[|V^{\text{SS}}(s) - V^*(s)| \leq \epsilon\]

\(m\), \(\epsilon\), and \(d\) related, but independent of \(|S|\)

Break

Draw the trees produced by the following algorithms for a problem with 2 actions and 3 states:

  1. One-step lookahead with rollout
  2. Forward search (d=2)
  3. Sparse sampling (d=2, m=2)

Branch and Bound

Assume you have \(\underline{V}(s)\) and \(\bar{Q}(s, a)\)

Forward Search Sparse Sampling

(FSSS)

Paper: https://cdn.aaai.org/ojs/7689/7689-13-11219-1-2-20201228.pdf

  • Sparse Sampling, but only look at potentially valuable states

Things it keeps track of:
    \(Q(s,a)\): Estimate of the value for the state action pair
    \(U(s)\): Upper bound for value of state s
    \(L(s)\): Lower bound for value of state s
    \(U(s,a)\): Upper bound for value of state-action
    \(L(s,a)\): Lower bound for value of state-action

Forward Search Sparse Sampling

If \(L(s,a*)\geq \max_{a\neq a^*} U(s,a)\) for best action (\(a^*=\arg\max_a U(s,a)\)):
    then, the node is closed because the best action is found.

Monte Carlo Tree Search (MCTS/UCT)

  • FSSS, but with less to keep track of

Keep track of:
    \(Q(s,a)\): Value estimate of that state and action combo
    \(N(s,a)\): Number of times we visit a state and action combo

\[Q(s, a) + c \sqrt{\frac{\log N(s)}{N(s,a)}}\]

low \(N(s, a)/N(s)\) = high bonus

start with \(c = 2(\bar{V} - \underline{V})\), \(\beta = 1/4\)

\[Q(s, a) + c \frac{N(s)^\beta}{\sqrt{N(s, a)}}\]

Full story can be found in https://arxiv.org/pdf/1902.05213.pdf

Monte Carlo Tree Search (MCTS/UCT)

  • FSSS, but with less to keep track of

Monte Carlo Tree Search (MCTS/UCT)

Search

Expansion

Rollout

Backup

\[Q(s, a) + c \sqrt{\frac{\log N(s)}{N(s,a)}}\]

low \(N(s, a)/N(s)\) = high bonus

start with \(c = 2(\bar{V} - \underline{V})\), \(\beta = 1/4\)

\[Q(s, a) + c \frac{N(s)^\beta}{\sqrt{N(s, a)}}\]

or

Improve: Show which lines these steps correspond to, explain recusive structure

Improve for next year: How online methods are used in a simulation

Guiding Questions

  • What are the differences between online and offline solutions?

  • Are there solution techniques that are independent of the state space size?

070-Online-Methods

By Zachary Sunberg

070-Online-Methods

  • 282