(we think)
(approximately solve original problem)
(solve a slightly different problem)
SARSOP
(Next time)
Today!
\[\pi^* = \underset{\pi : B \to A}{\text{argmax}} \,\, \text{E}\left[\sum_{t=0}^\infty \gamma^t R(s_t, \pi(b_t))\right]\]
\[b' = \tau(b, a, o)\]
\[\pi^* = \underset{\pi : B \to A}{\text{argmax}} \,\, \text{E}\left[\sum_{t=0}^\infty \gamma^t R(s_t, \pi(b_t))\right]\]
\[b' = \tau(b, a, o)\]
\[\pi_{\text{CE}}(b) = \pi_s (\hat{s}(b))\]
\[b' = \tau(b, a, o)\]
Optimal for Linear-Quadratic-Gaussian (LQG)
(Analogous to LQR MDP)
\(\pi^*_\text{LQG}(b) = -K_\text{LQR} \,\mu_b\)
LQG POMDP
\(S = \mathbb{R}^n, A = \mathbb{R}^m, O = \mathbb{R}^p\)
\(R(s, a) = -s^\top R_s s - a^\top R_a a\)
\[\pi^* = \underset{\pi : B \to A}{\text{argmax}} \,\, \text{E}\left[\sum_{t=0}^\infty \gamma^t R(s_t, \pi(b_t))\right]\]
\[b' = \tau(b, a, o)\]
\[\pi_\text{QMDP}(b) = \underset{a \in A}{\text{argmax}} \,\, \underset{s\sim b}{\text{E}}\left[Q_\text{MDP}(s, a)\right]\]
\[b' = \tau(b, a, o)\]
State
Timestep
Accurate Observations
Goal: \(a=0\) at \(s=0\)
Optimal Policy
Localize
\(a=0\)
Same as full observability on the next step
(Break)
QMDP
Full POMDP
INDUSTRIAL GRADE
Used for ACAS X
[Kochenderfer, 2011]
Very effective for most POMDPs and cheap (same cost as solving MDP)
Two cases where it may not work well:
\[\pi^* = \underset{\pi : B \to A}{\text{argmax}} \,\, \text{E}\left[\sum_{t=0}^\infty \gamma^t R(s_t, \pi(b_t))\right]\]
\[b' = \tau(b, a, o)\]
\[\pi^* = \underset{\pi : B \to A}{\text{argmax}} \,\, \text{E}\left[\sum_{t=0}^\infty \gamma^t R(s_t, \pi(b_t))\right]\]
\[b' = \tau(b, a, o)\]
\[\pi^* = \underset{\pi : B \to A}{\text{argmax}} \,\, \text{E}\left[\sum_{t=0}^\infty \gamma^t R(s_t, \pi(b_t))\right]\]
\[b' = \tau(b, a, o)\]
A.K.A. Open Loop Feedback Control (OLFC)
\[\pi^* = \underset{\pi : B \to A}{\text{argmax}} \,\, \text{E}\left[\sum_{t=0}^\infty \gamma^t R(s_t, \pi(b_t))\right]\]
\[b' = \tau(b, a, o)\]