\((S, A, O, R, T, Z, \gamma)\)
\(b_t(s) = P(s_t = s \mid h_t)\)
\(b' = \tau (b, a, o)\)
\[b'(s') \propto Z(o \mid a, s') \sum_{s} T(s' \mid s, a) \, b(s)\]
Environment
\(a\)
Policy
\(s\)
Environment
Option 2: Belief Updater
Policy
\(b\)
\(a\)
True State
\(s = TL\)
Observation \(o = TL\)
Belief: \(b_t = P(s_t \mid h_t)\)
\(TL\)
\(TR\)
(Options below)
Option 1: History
\(h\)
History: \(h_t = (b_0, a_0, o_1, a_1, \ldots a_{t-1}, o_{t})\)
\[S = \{h, \neg h\}\]
\[A = \{f, \neg f\}\]
\[O = \{c, \neg c\}\]
\[R(s)= \begin{cases}-10 \text{ if } s = h \\0 \text{ otherwise}\end{cases}\]
\[R(a)= \begin{cases}-5 \text{ if } a = f \\0 \text{ otherwise}\end{cases}\]
\[R(s, a) = R(s) + R(a)\]
\[T(h \mid h, \neg f) = 1.0\]
\[T(h\mid \neg h, \neg f) = 0.1\]
\[T(\neg h \mid \cdot , f) = 1.0\]
\[Z(c \mid \cdot, h) = 0.8)\]
\[Z(c \mid \cdot, \neg h) = 0.1\]
\[\gamma = 0.9\]
\[b'(s') \propto Z(o \mid a, s') \sum_{s} T(s' \mid s, a) \, b(s)\]
Starting at a \(b(h) = 0\), calculate \(b'\) with \(a=\neg f\) and \(o = c\).
\[b'(s') \propto Z(o \mid a, s') \sum_{s} T(s' \mid s, a) \, b(s)\]
Environment
Option 2: Belief Updater
Policy
\(b\)
\(a\)
True State
\(s = TL\)
Observation \(o = TL\)
Belief: \(b_t = P(s_t \mid h_t)\)
\(TL\)
\(TR\)
(Options below)
Option 1: History
\(h\)
History: \(h_t = (b_0, a_0, o_1, a_1, \ldots a_{t-1}, o_{t})\)
How do we calculate the optimal action in a POMDP?
\[S = \{h, \neg h\}\]
\[A = \{f, \neg f\}\]
\[O = \{c, \neg c\}\]
\[R(s)= \begin{cases}-10 \text{ if } s = h \\0 \text{ otherwise}\end{cases}\]
\[R(a)= \begin{cases}-5 \text{ if } a = f \\0 \text{ otherwise}\end{cases}\]
\[R(s, a) = R(s) + R(a)\]
\[T(h \mid h, \neg f) = 1.0\]
\[T(h\mid \neg h, \neg f) = 0.1\]
\[T(\neg h \mid \cdot , f) = 1.0\]
\[Z(c \mid \cdot, h) = 0.8)\]
\[Z(c \mid \cdot, \neg h) = 0.1\]
\[\gamma = 0.9\]
Draw the 1-step utility \(\alpha\)-vectors for the Crying Baby problem.
1 Step:
2 Step:
\[|A|^\frac{(|O|^h - 1)}{(|O|-1)}\]
27 two step plans!
For 1-step: \(U^\pi (s) = R(s, \pi())\)
For 1-step: \(U^\pi (s) = R(s, \pi())\)
\[U^\pi (s) = -1+\gamma (-1)\]
\(U^\pi (TL) = -1 + 0.95 (1 (0.85 \times 10 + 0.15 \times -100))\)
\(= -7.175\)
For 1-step: \(U^\pi (s) = R(s, \pi())\)
\(U^\pi (TL) = 10 + \gamma(-1)\)
\(U^\pi(TL) = -1 + \gamma \, 10\)
\(U^\pi(TR) = -1 + \gamma ( -100)\)
\(U^\pi (TR) = -100 + \gamma(-1)\)
\[V^*(b) = \underset{\alpha \in \Gamma}{\max}\, \alpha^\top b\]
\[S = \{h, \neg h\}\]
\[A = \{f, \neg f\}\]
\[O = \{c, \neg c\}\]
\[R(s)= \begin{cases}-10 \text{ if } s = h \\0 \text{ otherwise}\end{cases}\]
\[R(a)= \begin{cases}-5 \text{ if } a = f \\0 \text{ otherwise}\end{cases}\]
\[R(s, a) = R(s) + R(a)\]
\[T(h \mid h, \neg f) = 1.0\]
\[T(h\mid \neg h, \neg f) = 0.1\]
\[T(\neg h \mid \cdot , f) = 1.0\]
\[Z(c \mid \cdot, h) = 0.8)\]
\[Z(c \mid \cdot, \neg h) = 0.1\]
\[\gamma = 0.9\]
\[\underset{\delta, b}{\text{maximize}} \quad \delta\]
\[\text{subject to} \quad b \geq 0\]
\[\mathbf{1}^\top b = 1\]
\[\alpha^\top b \geq \alpha'^\top b + \delta \quad \forall \alpha' \in \Gamma\]
If there is a solution, \(\alpha\) is not dominated; \(b\) solution sometimes called "witness".
"Linear Program"
\(\Gamma^0 \gets \emptyset\)
for \(n \in 1\ldots d\)
Construct \(\Gamma^n\) by expanding with \(\Gamma^{n-1}\)
Prune \(\Gamma^n\)
belief space
\(\alpha\)-vectors
conditional plan