(approximately solve original problem)
(solve a slightly different problem)
Previously
Today!
Last Time
Environment
Option 2: Belief Updater
Policy
\(b\)
\(a\)
True State
\(s = TL\)
Observation \(o = TL\)
Belief: \(b_t = P(s_t \mid h_t)\)
\(TL\)
\(TR\)
(Options below)
Solver
Planner
Option 1: History
\(h\)
History: \(h_t = (b_0, a_0, o_1, a_1, \ldots a_{t-1}, o_{t})\)
[Ross, et al., Online Planning Algorithms for POMDPs, 2008]
Next Year: Skip AEMS
Search
Expansion
Rollout
Backup
\[Q(s, a) + c \sqrt{\frac{\log N(s)}{N(s,a)}}\]
low \(N(s, a)/N(s)\) = high bonus
start with \(c = 2(\bar{V} - \underline{V})\)
Environment
Policy
\(a\)
True State
\(s = TL\)
Observation \(o = TL\)
(Options below)
Option 1: History
\(h\)
History: \(h_t = (b_0, a_0, o_1, a_1, \ldots a_{t-1}, o_{t})\)
Option 2: Belief Updater
\(b\)
Belief: \(b_t = P(s_t \mid h_t)\)
\(TL\)
\(TR\)
Somani, A., Ye, N., Hsu, D., & Lee, W. "DESPOT : Online POMDP Planning with Regularization." Journal of Artificial Intelligence Research, 2017
POMCP
POMCPOW
\[|Q_{\mathbf{P}}^*(b,a) - Q_{\mathbf{M}_{\mathbf{P}}}^*(\bar{b},a)| \leq \epsilon \quad \text{w.p. } 1-\delta\]
For any \(\epsilon>0\) and \(\delta>0\), if \(C\) (number of particles) is high enough,
\(\mathbf{M}_\mathbf{P}\) = Particle belief MDP approximation of POMDP \(\mathbf{P}\)
[Lim, Becker, Kochenderfer, Tomlin, & Sunberg, 2023]
No dependence on \(|\mathcal{S}|\) or \(|\mathcal{O}|\)!