Offline POMDP Algorithms
Last time: POMDP Value Iteration (horizon \(d\))
\(\Gamma^0 \gets \emptyset\)
for \(n \in 1\ldots d\)
Construct \(\Gamma^n\) by expanding with \(\Gamma^{n-1}\)
Prune \(\Gamma^n\)
Finite Horizon POMDP Value Iteration

Finite Horizon POMDP Value Iteration

Infinite-Horizon POMDP Lower Bound Improvement
\(\Gamma \gets\) blind lower bound
\(\Gamma \gets \Gamma \cup \text{backup}(\Gamma)\)
\(\Gamma \gets \text{prune}(\Gamma)\)
Infinite-Horizon POMDP Value Iteration
\(\Gamma \gets\) blind lower bound
\(\Gamma \gets \Gamma \cup \text{backup}(\Gamma)\)
\(\Gamma \gets \text{prune}(\Gamma)\)

Point-Based Value Iteration (PBVI)
backup\((\Gamma, b)\)
for \(a \in A\)
for \(o \in O\)
\(b' \gets \tau(b, a, o)\)
\(\alpha_{a,o} \gets \underset{\alpha \in \Gamma}{\text{argmax}} \; \alpha^\top b'\)
for \(s \in S\)
\(\alpha_a[s] = R(s, a) + \gamma \sum_{s', o} T(s'\mid s, a) \,Z(o' \mid a, s') \, \alpha_{a, o}[s']\)
return \(\underset{\alpha_a}{\text{argmax}} \; \alpha_a^\top b\)
Point-Based Value Iteration (PBVI)
function point_backup\((\Gamma, b)\)
for \(a \in A\)
for \(o \in O\)
\(b' \gets \tau(b, a, o)\)
\(\alpha_{a,o} \gets \underset{\alpha \in \Gamma}{\text{argmax}} \; \alpha^\top b'\)
for \(s \in S\)
\(\alpha_a[s] = R(s, a) + \gamma \sum_{s', o} T(s'\mid s, a) \,Z(o \mid a, s') \, \alpha_{a, o}[s']\)
return \(\underset{\alpha_a}{\text{argmax}} \; \alpha_a^\top b\)

Original PBVI
\(B \gets {b_0}\)
for \(b \in B\)
\(\Gamma \gets \Gamma \cup \{\text{point\_backup}(\Gamma, b)\}\)
for \(b \in B\)
\(\tilde{B} \gets \{\tau(b, a, o) : a \in A, o \in O\}\)
\(B' \gets B' \cup \left\{\underset{b' \in \tilde{B}}{\text{argmax}} \; \lVert B, b' \rVert\right\}\)
\(B \gets B \cup B'\)
Original PBVI

PERSEUS: Randomly Selected Beliefs
Two Phases:
- Random Exploration
- Value Backup
Random Exploration:
\(B \gets \emptyset\)
\(b \gets b_0\)
loop until \(\lvert B \rvert = n\)
\(a \gets \text{rand}(A)\)
\(o \gets \text{rand}(P(o \mid b, a))\)
\(b \gets \tau(b, a, o)\)
\(B = B \cup \{b\}\)
Heuristic Search Value Iteration (HSVI)
while \(\overline{V}(b_0) - \underline{V}(b_0) > \epsilon \)
explore\((b_0, 0)\)
function explore(b, t)
if \(\overline{V}(b) - \underline{V}(b) > \epsilon \gamma^t\)
\(a^* = \underset{a}{\text{argmax}} \; \overline{Q}(b, a)\)
\(o^* = \underset{o}{\text{argmax}} \; P(o \mid b, a) \left(\overline{V}(\tau(b, a^*, o)) - \underline{V}(\tau(b, a^*, o)) - \epsilon \gamma^t\right)\)
explore(\(\tau(b, a^*, o^*), t+1\))
\(\underline{\Gamma} \gets \underline{\Gamma} \cup \text{point\_backup}(\underline{\Gamma}, b)\)
\(\overline{V}(b) = B_b \left[ \overline{V}(b) \right]\)
Heuristic Search Value Iteration

Sawtooth Upper Bounds
Sawtooth Upper Bounds

Successive Approximation of Reachable Space under Optimal Policies
Successive Approximation of Reachable Space under Optimal Policies

Offline POMDP Algorithms
Offline POMDP Algorithms

Policy Graphs
Policy Graphs

Monte Carlo Value Iteration (MCVI)
Monte Carlo Value Iteration (MCVI)

180 Offline POMDP Algorithms
By Zachary Sunberg
180 Offline POMDP Algorithms
- 297