Does value iteration always converge?
Is the value function unique?
What are the differences between online and offline solutions?
Are there solution techniques that require computation time independent of the state space size?
Problems Policy and Value Iteration may struggle with?
Path planning across the country, or interplanetary
More realistic car dynamics (continuous states)
\(n\) dimensions, \(k\) segments \(\,\rightarrow \, |\mathcal{S}| = k^n\)
1 dimension, 5 segments
\(|\mathcal{S}| = 5\)
2 dimensions, 5 segments
\(|\mathcal{S}| = 25\)
3 dimensions, 5 segments
\(|\mathcal{S}| = 125\)
Offline
Online
\(O\left((|S|\times|A|)^d\right)\)
\(O\left((m|A|)^d\right)\)
\[|V^{\text{SS}}(s) - V^*(s)| \leq \epsilon\]
\(m\), \(\epsilon\), and \(d\) related, but independent of \(|S|\)
Draw the trees produced by the following algorithms for a problem with 2 actions and 3 states:
Assume you have \(\underline{V}(s)\) and \(\bar{Q}(s, a)\)
(FSSS)
Paper: https://cdn.aaai.org/ojs/7689/7689-13-11219-1-2-20201228.pdf
Things it keeps track of:
\(Q(s,a)\): Estimate of the value for the state action pair
\(U(s)\): Upper bound for value of state s
\(L(s)\): Lower bound for value of state s
\(U(s,a)\): Upper bound for value of state-action
\(L(s,a)\): Lower bound for value of state-action
If \(L(s,a*)\geq \max_{a\neq a^*} U(s,a)\) for best action (\(a^*=\arg\max_a U(s,a)\)):
then, the node is closed because the best action is found.
Keep track of:
\(Q(s,a)\): Value estimate of that state and action combo
\(N(s,a)\): Number of times we visit a state and action combo
\[Q(s, a) + c \sqrt{\frac{\log N(s)}{N(s,a)}}\]
low \(N(s, a)/N(s)\) = high bonus
start with \(c = 2(\bar{V} - \underline{V})\), \(\beta = 1/4\)
\[Q(s, a) + c \frac{N(s)^\beta}{\sqrt{N(s, a)}}\]
Full story can be found in https://arxiv.org/pdf/1902.05213.pdf
Search
Expansion
Rollout
Backup
\[Q(s, a) + c \sqrt{\frac{\log N(s)}{N(s,a)}}\]
low \(N(s, a)/N(s)\) = high bonus
start with \(c = 2(\bar{V} - \underline{V})\), \(\beta = 1/4\)
\[Q(s, a) + c \frac{N(s)^\beta}{\sqrt{N(s, a)}}\]
or
Improve: Show which lines these steps correspond to, explain recusive structure
What are the differences between online and offline solutions?
Are there solution techniques that are independent of the state space size?