Offline POMDP Algorithms

Last time: POMDP Value Iteration (horizon \(d\))

\(\Gamma^0 \gets \emptyset\)

for \(n \in 1\ldots d\)

    Construct \(\Gamma^n\) by expanding with \(\Gamma^{n-1}\)

    Prune \(\Gamma^n\)

Finite Horizon POMDP Value Iteration

Finite Horizon POMDP Value Iteration

Infinite-Horizon POMDP Lower Bound Improvement

\(\Gamma \gets\) blind lower bound

loop

    \(\Gamma \gets \Gamma \cup \text{backup}(\Gamma)\)

    \(\Gamma \gets \text{prune}(\Gamma)\)

Infinite-Horizon POMDP Value Iteration

\(\Gamma \gets\) blind lower bound

loop

    \(\Gamma \gets \Gamma \cup \text{backup}(\Gamma)\)

    \(\Gamma \gets \text{prune}(\Gamma)\)

Point-Based Value Iteration (PBVI)

backup\((\Gamma, b)\)

    for \(a \in A\)

        for \(o \in O\)

            \(b' \gets \tau(b, a, o)\)

            \(\alpha_{a,o} \gets \underset{\alpha \in \Gamma}{\text{argmax}} \; \alpha^\top b'\)

        for \(s \in S\)

            \(\alpha_a[s] = R(s, a) + \gamma \sum_{s', o} T(s'\mid s, a) \,Z(o' \mid a, s') \, \alpha_{a, o}[s']\)

    return \(\underset{\alpha_a}{\text{argmax}} \; \alpha_a^\top b\)

Point-Based Value Iteration (PBVI)

function point_backup\((\Gamma, b)\)

    for \(a \in A\)

        for \(o \in O\)

            \(b' \gets \tau(b, a, o)\)

            \(\alpha_{a,o} \gets \underset{\alpha \in \Gamma}{\text{argmax}} \; \alpha^\top b'\)

        for \(s \in S\)

            \(\alpha_a[s] = R(s, a) + \gamma \sum_{s', o} T(s'\mid s, a) \,Z(o \mid a, s') \, \alpha_{a, o}[s']\)

    return \(\underset{\alpha_a}{\text{argmax}} \; \alpha_a^\top b\)

Original PBVI

\(B \gets {b_0}\)

loop

    for \(b \in B\)

        \(\Gamma \gets \Gamma \cup \{\text{point\_backup}(\Gamma, b)\}\)

    for \(b \in B\)

        \(\tilde{B} \gets \{\tau(b, a, o) : a \in A, o \in O\}\)

        \(B' \gets B' \cup \left\{\underset{b' \in \tilde{B}}{\text{argmax}} \; \lVert B, b' \rVert\right\}\)

    \(B \gets B \cup B'\)

Original PBVI

PERSEUS: Randomly Selected Beliefs

Two Phases:

  1. Random Exploration
  2. Value Backup

 

Random Exploration:

\(B \gets \emptyset\)

\(b \gets b_0\)

loop until \(\lvert B \rvert = n\)

    \(a \gets \text{rand}(A)\)

    \(o \gets \text{rand}(P(o \mid b, a))\)

    \(b \gets \tau(b, a, o)\)

    \(B = B \cup \{b\}\)

Heuristic Search Value Iteration (HSVI)

while \(\overline{V}(b_0) - \underline{V}(b_0) > \epsilon \)

    explore\((b_0, 0)\)

function explore(b, t)

    if \(\overline{V}(b) - \underline{V}(b) > \epsilon \gamma^t\)

        \(a^* = \underset{a}{\text{argmax}} \; \overline{Q}(b, a)\)

        \(o^* = \underset{o}{\text{argmax}} \; P(o \mid b, a) \left(\overline{V}(\tau(b, a^*, o)) - \underline{V}(\tau(b, a^*, o)) - \epsilon \gamma^t\right)\)

        explore(\(\tau(b, a^*, o^*), t+1\))

        \(\underline{\Gamma} \gets \underline{\Gamma} \cup \text{point\_backup}(\underline{\Gamma}, b)\)

        \(\overline{V}(b) = B_b \left[ \overline{V}(b) \right]\)

 

Heuristic Search Value Iteration

Sawtooth Upper Bounds

Sawtooth Upper Bounds

SARSOP

Successive Approximation of Reachable Space under Optimal Policies

SARSOP

Successive Approximation of Reachable Space under Optimal Policies

Offline POMDP Algorithms

Offline POMDP Algorithms

Policy Graphs

Policy Graphs

Monte Carlo Value Iteration (MCVI)

Monte Carlo Value Iteration (MCVI)

180 Offline POMDP Algorithms

By Zachary Sunberg

180 Offline POMDP Algorithms

  • 166