Exact POMDP Solutions:
\(\alpha\)-vectors

Recap

POMDP
Belief Updates

\((S, A, O, R, T, Z, \gamma)\)

\(b_t(s) = P(s_t = s \mid h_t)\)

\(b' = \tau (b, a, o)\)

\[b'(s') \propto Z(o \mid a, s') \sum_{s} T(s' \mid s, a) \, b(s)\]

MDP Sense-Plan-Act Loop

Environment

\(a\)

Policy

\(s\)

POMDP Sense-Plan-Act Loop

Environment

Option 2: Belief Updater

Policy

\(b\)

\(a\)

True State

\(s = TL\)

Observation \(o = TL\)

Belief: \(b_t = P(s_t \mid h_t)\)

\(TL\)

\(TR\)

(Options below)

Option 1: History

\(h\)

History: \(h_t = (b_0, a_0, o_1, a_1, \ldots a_{t-1}, o_{t})\)

Exercise 1: Crying Baby Belief Update

\[S = \{h, \neg h\}\]

\[A = \{f, \neg f\}\]

\[O = \{c, \neg c\}\]

\[R(s)= \begin{cases}-10 \text{ if } s = h \\0 \text{ otherwise}\end{cases}\]

\[R(a)= \begin{cases}-5 \text{ if } a = f \\0 \text{ otherwise}\end{cases}\]

\[R(s, a) = R(s) + R(a)\]

\[T(h \mid h, \neg f) = 1.0\]

\[T(h\mid \neg h, \neg f) = 0.1\]

\[T(\neg h \mid \cdot , f) = 1.0\]

\[Z(c \mid \cdot, h) = 0.8)\]

\[Z(c \mid \cdot, \neg h) = 0.1\]

\[\gamma = 0.9\]

\[b'(s') \propto Z(o \mid a, s') \sum_{s} T(s' \mid s, a) \, b(s)\]

Starting at a \(b(h) = 0\), calculate \(b'\) with \(a=\neg f\) and \(o = c\).

Belief Dynamics

\[b'(s') \propto Z(o \mid a, s') \sum_{s} T(s' \mid s, a) \, b(s)\]

POMDP Sense-Plan-Act Loop

Environment

Option 2: Belief Updater

Policy

\(b\)

\(a\)

True State

\(s = TL\)

Observation \(o = TL\)

Belief: \(b_t = P(s_t \mid h_t)\)

\(TL\)

\(TR\)

(Options below)

Option 1: History

\(h\)

History: \(h_t = (b_0, a_0, o_1, a_1, \ldots a_{t-1}, o_{t})\)

Guiding Quesiton

How do we calculate the optimal action in a POMDP?

Solving the Tiger POMDP

One-step utility

Exercise 2: Crying Baby 1-Step Utility

\[S = \{h, \neg h\}\]

\[A = \{f, \neg f\}\]

\[O = \{c, \neg c\}\]

\[R(s)= \begin{cases}-10 \text{ if } s = h \\0 \text{ otherwise}\end{cases}\]

\[R(a)= \begin{cases}-5 \text{ if } a = f \\0 \text{ otherwise}\end{cases}\]

\[R(s, a) = R(s) + R(a)\]

\[T(h \mid h, \neg f) = 1.0\]

\[T(h\mid \neg h, \neg f) = 0.1\]

\[T(\neg h \mid \cdot , f) = 1.0\]

\[Z(c \mid \cdot, h) = 0.8)\]

\[Z(c \mid \cdot, \neg h) = 0.1\]

\[\gamma = 0.9\]

Draw the 1-step utility \(\alpha\)-vectors for the Crying Baby problem.

Alpha Vectors for Conditional Plans

Conditional Plans: fixed-depth history-based policies

1 Step:

2 Step:

\[|A|^\frac{(|O|^h - 1)}{(|O|-1)}\]

27 two step plans!

Alpha Vectors for Conditional Plans

For 1-step: \(U^\pi (s) = R(s, \pi())\)

Alpha Vectors for Conditional Plans

For 1-step: \(U^\pi (s) = R(s, \pi())\)

\[U^\pi (s) = -1+\gamma (-1)\]

\(U^\pi (TL) = -1 + 0.95 (1 (0.85 \times 10 + 0.15 \times -100))\)

\(= -7.175\)

Alpha Vectors for Conditional Plans

For 1-step: \(U^\pi (s) = R(s, \pi())\)

\(U^\pi (TL) = 10 + \gamma(-1)\)

\(U^\pi(TL) = -1 + \gamma \, 10\)

\(U^\pi(TR) = -1 + \gamma ( -100)\)

\(U^\pi (TR) = -100 + \gamma(-1)\)

POMDP Value Functions

\[V^*(b) = \underset{\alpha \in \Gamma}{\max}\, \alpha^\top b\]

Exercise: 2-Step Crying Baby \(\alpha\) Vectors

\[S = \{h, \neg h\}\]

\[A = \{f, \neg f\}\]

\[O = \{c, \neg c\}\]

\[R(s)= \begin{cases}-10 \text{ if } s = h \\0 \text{ otherwise}\end{cases}\]

\[R(a)= \begin{cases}-5 \text{ if } a = f \\0 \text{ otherwise}\end{cases}\]

\[R(s, a) = R(s) + R(a)\]

\[T(h \mid h, \neg f) = 1.0\]

\[T(h\mid \neg h, \neg f) = 0.1\]

\[T(\neg h \mid \cdot , f) = 1.0\]

\[Z(c \mid \cdot, h) = 0.8)\]

\[Z(c \mid \cdot, \neg h) = 0.1\]

\[\gamma = 0.9\]

\(\alpha\)-Vector Pruning

\[\underset{\delta, b}{\text{maximize}} \quad \delta\]

\[\text{subject to} \quad b \geq 0\]

\[\mathbf{1}^\top b = 1\]

\[\alpha^\top b \geq \alpha'^\top b + \delta \quad \forall \alpha' \in \Gamma\]

If there is a solution, \(\alpha\) is not dominated; \(b\) solution sometimes called "witness".

"Linear Program"

Alpha Vector Expansion

POMDP Value Iteration (horizon \(d\))

\(\Gamma^0 \gets \emptyset\)

for \(n \in 1\ldots d\)

Construct \(\Gamma^n\) by expanding with \(\Gamma^{n-1}\)

Prune \(\Gamma^n\)

Finite Horizon POMDP Value Iteration

Recap

A POMDP is an MDP on the ___________
The value function of a discrete POMDP can be represented by a set of ________________
Each \(\alpha\) vector corresponds to a _________________

belief space

\(\alpha\)-vectors

conditional plan

170 Exact POMDP Solutions: Alpha Vectors

By Zachary Sunberg

170 Exact POMDP Solutions: Alpha Vectors

Exact POMDP Solutions: \(\alpha\)-vectors

Recap

MDP Sense-Plan-Act Loop

POMDP Sense-Plan-Act Loop

Exercise 1: Crying Baby Belief Update

Belief Dynamics

POMDP Sense-Plan-Act Loop

Guiding Quesiton

Solving the Tiger POMDP

One-step utility

One-step utility

Exercise 2: Crying Baby 1-Step Utility

Alpha Vectors for Conditional Plans

Conditional Plans: fixed-depth history-based policies

Alpha Vectors for Conditional Plans

Alpha Vectors for Conditional Plans

Alpha Vectors for Conditional Plans

POMDP Value Functions

Exercise: 2-Step Crying Baby \(\alpha\) Vectors

\(\alpha\)-Vector Pruning

\(\alpha\)-Vector Pruning

Alpha Vector Expansion

Alpha Vector Expansion

POMDP Value Iteration (horizon \(d\))

Finite Horizon POMDP Value Iteration

Finite Horizon POMDP Value Iteration

Recap

170 Exact POMDP Solutions: Alpha Vectors

More from Zachary Sunberg

Exact POMDP Solutions:
\(\alpha\)-vectors