Exact POMDP Solutions:
\(\alpha\)-vectors
Recap
- POMDP
- Belief Updates
\((S, A, O, R, T, Z, \gamma)\)
\(b_t(s) = P(s_t = s \mid h_t)\)
\(b' = \tau (b, a, o)\)
\[b'(s') \propto Z(o \mid a, s') \sum_{s} T(s' \mid s, a) \, b(s)\]
MDP Sense-Plan-Act Loop
Environment
\(a\)
Policy
\(s\)
POMDP Sense-Plan-Act Loop
Environment
Option 2: Belief Updater
Policy
\(b\)
\(a\)
True State
\(s = TL\)
Observation \(o = TL\)
Belief: \(b_t = P(s_t \mid h_t)\)
\(TL\)
\(TR\)
(Options below)
Option 1: History
\(h\)
History: \(h_t = (b_0, a_0, o_1, a_1, \ldots a_{t-1}, o_{t})\)
Exercise 1: Crying Baby Belief Update
\[S = \{h, \neg h\}\]
\[A = \{f, \neg f\}\]
\[O = \{c, \neg c\}\]
\[R(s)= \begin{cases}-10 \text{ if } s = h \\0 \text{ otherwise}\end{cases}\]
\[R(a)= \begin{cases}-5 \text{ if } a = f \\0 \text{ otherwise}\end{cases}\]
\[R(s, a) = R(s) + R(a)\]
\[T(h \mid h, \neg f) = 1.0\]
\[T(h\mid \neg h, \neg f) = 0.1\]
\[T(\neg h \mid \cdot , f) = 1.0\]
\[Z(c \mid \cdot, h) = 0.8)\]
\[Z(c \mid \cdot, \neg h) = 0.1\]
\[\gamma = 0.9\]
\[b'(s') \propto Z(o \mid a, s') \sum_{s} T(s' \mid s, a) \, b(s)\]
Starting at a \(b(h) = 0\), calculate \(b'\) with \(a=\neg f\) and \(o = c\).
Belief Dynamics
\[b'(s') \propto Z(o \mid a, s') \sum_{s} T(s' \mid s, a) \, b(s)\]
POMDP Sense-Plan-Act Loop
Environment
Option 2: Belief Updater
Policy
\(b\)
\(a\)
True State
\(s = TL\)
Observation \(o = TL\)
Belief: \(b_t = P(s_t \mid h_t)\)
\(TL\)
\(TR\)
(Options below)
Option 1: History
\(h\)
History: \(h_t = (b_0, a_0, o_1, a_1, \ldots a_{t-1}, o_{t})\)
Guiding Quesiton
How do we calculate the optimal action in a POMDP?
Solving the Tiger POMDP
One-step utility
One-step utility
Exercise 2: Crying Baby 1-Step Utility
\[S = \{h, \neg h\}\]
\[A = \{f, \neg f\}\]
\[O = \{c, \neg c\}\]
\[R(s)= \begin{cases}-10 \text{ if } s = h \\0 \text{ otherwise}\end{cases}\]
\[R(a)= \begin{cases}-5 \text{ if } a = f \\0 \text{ otherwise}\end{cases}\]
\[R(s, a) = R(s) + R(a)\]
\[T(h \mid h, \neg f) = 1.0\]
\[T(h\mid \neg h, \neg f) = 0.1\]
\[T(\neg h \mid \cdot , f) = 1.0\]
\[Z(c \mid \cdot, h) = 0.8)\]
\[Z(c \mid \cdot, \neg h) = 0.1\]
\[\gamma = 0.9\]
Draw the 1-step utility \(\alpha\)-vectors for the Crying Baby problem.
Alpha Vectors for Conditional Plans
Conditional Plans: fixed-depth history-based policies
1 Step:
2 Step:
\[|A|^\frac{(|O|^h - 1)}{(|O|-1)}\]
27 two step plans!
Alpha Vectors for Conditional Plans
For 1-step: \(U^\pi (s) = R(s, \pi())\)
Alpha Vectors for Conditional Plans
For 1-step: \(U^\pi (s) = R(s, \pi())\)
\[U^\pi (s) = -1+\gamma (-1)\]
\(U^\pi (TL) = -1 + 0.95 (1 (0.85 \times 10 + 0.15 \times -100))\)
\(= -7.175\)
Alpha Vectors for Conditional Plans
For 1-step: \(U^\pi (s) = R(s, \pi())\)
\(U^\pi (TL) = 10 + \gamma(-1)\)
\(U^\pi(TL) = -1 + \gamma \, 10\)
\(U^\pi(TR) = -1 + \gamma ( -100)\)
\(U^\pi (TR) = -100 + \gamma(-1)\)
POMDP Value Functions
\[V^*(b) = \underset{\alpha \in \Gamma}{\max}\, \alpha^\top b\]
Exercise: 2-Step Crying Baby \(\alpha\) Vectors
\[S = \{h, \neg h\}\]
\[A = \{f, \neg f\}\]
\[O = \{c, \neg c\}\]
\[R(s)= \begin{cases}-10 \text{ if } s = h \\0 \text{ otherwise}\end{cases}\]
\[R(a)= \begin{cases}-5 \text{ if } a = f \\0 \text{ otherwise}\end{cases}\]
\[R(s, a) = R(s) + R(a)\]
\[T(h \mid h, \neg f) = 1.0\]
\[T(h\mid \neg h, \neg f) = 0.1\]
\[T(\neg h \mid \cdot , f) = 1.0\]
\[Z(c \mid \cdot, h) = 0.8)\]
\[Z(c \mid \cdot, \neg h) = 0.1\]
\[\gamma = 0.9\]
\(\alpha\)-Vector Pruning
\(\alpha\)-Vector Pruning
\[\underset{\delta, b}{\text{maximize}} \quad \delta\]
\[\text{subject to} \quad b \geq 0\]
\[\mathbf{1}^\top b = 1\]
\[\alpha^\top b \geq \alpha'^\top b + \delta \quad \forall \alpha' \in \Gamma\]
If there is a solution, \(\alpha\) is not dominated; \(b\) solution sometimes called "witness".
"Linear Program"
Alpha Vector Expansion
Alpha Vector Expansion
POMDP Value Iteration (horizon \(d\))
\(\Gamma^0 \gets \emptyset\)
for \(n \in 1\ldots d\)
Construct \(\Gamma^n\) by expanding with \(\Gamma^{n-1}\)
Prune \(\Gamma^n\)
Finite Horizon POMDP Value Iteration
Finite Horizon POMDP Value Iteration
Recap
- A POMDP is an MDP on the ___________
- The value function of a discrete POMDP can be represented by a set of ________________
- Each \(\alpha\) vector corresponds to a _________________
belief space
\(\alpha\)-vectors
conditional plan
170 Exact POMDP Solutions: Alpha Vectors
By Zachary Sunberg
170 Exact POMDP Solutions: Alpha Vectors
- 357