How do we reason about the future consequences of actions in an MDP?
What are the basic algorithms for solving MDPs?
Does value iteration always converge?
Is the value function unique?
Can there be multiple optimal policies?
Is there always a deterministic optimal policy?
Algorithm: Value Iteration
while \(\lVert V - V' \rVert_\infty > \epsilon\)
\(V \gets V'\)
\(V' \gets B[V]\)
return \(V'\)
\[B[V](s) = \underset{a \in A}{\text{max}}\left(R(s, a) + \gamma E \left[V(s')\right]\right)\]
Theorem 1: Let \(\{V_1, \ldots, V_\infty\}\) be a sequence of value functions for a discrete MDP generated by the recurrence \(V_{k+1} = B[V_k]\). If \(\gamma < 1\), then \(\lim_{k\to\infty} V_k = V^*\).
Definition: Let \(M\) be a set. A metric on \(M\) is a function \(d: M \times M \to [0, \infty)\) which satisfies the following three conditions for all \(x, y, z \in M\):
Definition: A contraction mapping on metric space \((M, d)\) is a function \(f: M \to M\) satisfying
\[d(f(x), f(y)) \leq \alpha \, d(x, y)\]
for some \(\alpha\), \(0 \leq \alpha \leq 1\) and all \(x\) and \(y\) in \(M\).
Definition: \(x^*\) is said to be a fixed point of \(f\) if \(f(x^*) = x^*\).
Script: contraction_mapping.jl
Theorem (Banach): If \(f\) is a contraction mapping on metric space \((M, d)\), then
Lemma 1: \(\left( \mathbb{R}^{|S|}, \lVert \cdot \rVert_{\infty}\right)\) is a metric space.
Proof:
Definition: Let \(M\) be a set. A metric on \(M\) is a function \(d: M \times M \to [0, \infty)\) which satisfies the following three conditions for all \(x, y, z \in M\):
Note: \(\lVert x-y \rVert_{\infty} = \max_i |x_i-y_i |\)
\(\leq \max(|x-y| + |y-z|)\)
\(\leq \max|x-y| + \max|y-z|\)
\[\lVert B[V_1] - B[V_2] \rVert_\infty = \max_{s\in S}\left|B[V_1](s) - B[V_2](s)\right|\]
\[= \max_{s \in S} \left| \max_{a \in A} \left(R(s, a) + \gamma \sum_{s'\in S} T(s'|s,a) V_1(s')\right) - \max_{a \in A} \left(R(s, a) + \gamma \sum_{s'\in S} T(s'|s,a) V_2(s')\right)\right|\]
Lemma 2: \(B\) is a \(\gamma\) contraction mapping on \((\mathbb{R}^{|S|}, \lVert \cdot \rVert_\infty)\).
\[\leq \max_{s \in S} \left| \max_{a \in A} \left(R(s, a) + \gamma \sum_{s'\in S} T(s'|s,a) V_1(s') - R(s, a) - \gamma \sum_{s'\in S} T(s'|s,a) V_2(s')\right)\right|\]
\[\leq \max_{s \in S, a \in A} \left|\gamma \sum_{s'\in S} T(s'|s,a) \left(V_1(s') - V_2(s')\right)\right|\]
\[\leq \max_{s \in S, a \in A} \gamma \sum_{s'\in S} T(s'|s,a) \left| V_1(s') - V_2(s')\right|\]
\[\leq \max_{s \in S, a \in A} \gamma \sum_{s'\in S} T(s'|s,a) \lVert V_1 - V_2\rVert_\infty\]
\[= \gamma \lVert V_1 - V_2\rVert_\infty \max_{s \in S, a \in A} \sum_{s'\in S} T(s'|s,a) \]
\[= \gamma \lVert V_1 - V_2\rVert_\infty \]
\(|\max(x)| \leq \max |x|\)
Proof
Theorem 1: Let \(\{V_1, \ldots, V_\infty\}\) be a sequence of value functions for a discrete MDP generated by the recurrence \(V_{k+1} = B[V_k]\). If \(\gamma < 1\), then \(\lim_{k\to\infty} V_k = V^*\).
Proof:
By Lemma 2 and Banach's theorem (part 2), repeated application of the Bellman operator always has a fixed point limit, \(\hat{V}\).
By Banach's theorem (part 1), \(\hat{V} = B[\hat{V}]\). Since \(\hat{V}\) satisfies Bellman's equation, it is optimal and \(\hat{V} = V^*\).
Lemma 2: \(B\) is a \(\gamma\) contraction mapping on \((\mathbb{R}^{|S|}, \lVert \cdot \rVert_\infty)\).
Theorem (Banach): If \(f\) is a contraction mapping on metric space \((M, d)\), then
Theorem: Policy iteration converges to an optimal policy for a finite MDP in finite time.
Proof (sketch):
Does value iteration always converge?
Is the value function unique?
Can there be multiple optimal policies?
Is there always a deterministic optimal policy?
Conservation MDP