Offline POMDP Algorithms
Last time: POMDP Value Iteration (horizon \(d\))
\(\Gamma^0 \gets \emptyset\)
for \(n \in 1\ldots d\)
Construct \(\Gamma^n\) by expanding with \(\Gamma^{n-1}\)
Prune \(\Gamma^n\)
Finite Horizon POMDP Value Iteration
Finite Horizon POMDP Value Iteration
Infinite-Horizon POMDP Lower Bound Improvement
\(\Gamma \gets\) blind lower bound
loop
\(\Gamma \gets \Gamma \cup \text{backup}(\Gamma)\)
\(\Gamma \gets \text{prune}(\Gamma)\)
Infinite-Horizon POMDP Value Iteration
\(\Gamma \gets\) blind lower bound
loop
\(\Gamma \gets \Gamma \cup \text{backup}(\Gamma)\)
\(\Gamma \gets \text{prune}(\Gamma)\)
Point-Based Value Iteration (PBVI)
backup\((\Gamma, b)\)
for \(a \in A\)
for \(o \in O\)
\(b' \gets \tau(b, a, o)\)
\(\alpha_{a,o} \gets \underset{\alpha \in \Gamma}{\text{argmax}} \; \alpha^\top b'\)
for \(s \in S\)
\(\alpha_a[s] = R(s, a) + \gamma \sum_{s', o} T(s'\mid s, a) \,Z(o' \mid a, s') \, \alpha_{a, o}[s']\)
return \(\underset{\alpha_a}{\text{argmax}} \; \alpha_a^\top b\)
Point-Based Value Iteration (PBVI)
function point_backup\((\Gamma, b)\)
for \(a \in A\)
for \(o \in O\)
\(b' \gets \tau(b, a, o)\)
\(\alpha_{a,o} \gets \underset{\alpha \in \Gamma}{\text{argmax}} \; \alpha^\top b'\)
for \(s \in S\)
\(\alpha_a[s] = R(s, a) + \gamma \sum_{s', o} T(s'\mid s, a) \,Z(o \mid a, s') \, \alpha_{a, o}[s']\)
return \(\underset{\alpha_a}{\text{argmax}} \; \alpha_a^\top b\)
Original PBVI
\(B \gets {b_0}\)
loop
for \(b \in B\)
\(\Gamma \gets \Gamma \cup \{\text{point\_backup}(\Gamma, b)\}\)
for \(b \in B\)
\(\tilde{B} \gets \{\tau(b, a, o) : a \in A, o \in O\}\)
\(B' \gets B' \cup \left\{\underset{b' \in \tilde{B}}{\text{argmax}} \; \lVert B, b' \rVert\right\}\)
\(B \gets B \cup B'\)
Original PBVI
PERSEUS: Randomly Selected Beliefs
Two Phases:
- Random Exploration
- Value Backup
Random Exploration:
\(B \gets \emptyset\)
\(b \gets b_0\)
loop until \(\lvert B \rvert = n\)
\(a \gets \text{rand}(A)\)
\(o \gets \text{rand}(P(o \mid b, a))\)
\(b \gets \tau(b, a, o)\)
\(B = B \cup \{b\}\)
Heuristic Search Value Iteration (HSVI)
while \(\overline{V}(b_0) - \underline{V}(b_0) > \epsilon \)
explore\((b_0, 0)\)
function explore(b, t)
if \(\overline{V}(b) - \underline{V}(b) > \epsilon \gamma^t\)
\(a^* = \underset{a}{\text{argmax}} \; \overline{Q}(b, a)\)
\(o^* = \underset{o}{\text{argmax}} \; P(o \mid b, a) \left(\overline{V}(\tau(b, a^*, o)) - \underline{V}(\tau(b, a^*, o)) - \epsilon \gamma^t\right)\)
explore(\(\tau(b, a^*, o^*), t+1\))
\(\underline{\Gamma} \gets \underline{\Gamma} \cup \text{point\_backup}(\underline{\Gamma}, b)\)
\(\overline{V}(b) = B_b \left[ \overline{V}(b) \right]\)
Heuristic Search Value Iteration
Sawtooth Upper Bounds
Sawtooth Upper Bounds
SARSOP
Successive Approximation of Reachable Space under Optimal Policies
SARSOP
Successive Approximation of Reachable Space under Optimal Policies
Offline POMDP Algorithms
Offline POMDP Algorithms
Policy Graphs
Policy Graphs
Monte Carlo Value Iteration (MCVI)
Monte Carlo Value Iteration (MCVI)
180 Offline POMDP Algorithms
By Zachary Sunberg
180 Offline POMDP Algorithms
- 166