Value Iteration Convergence
Review
-
How do we reason about the future consequences of actions in an MDP?
-
What are the basic algorithms for solving MDPs?
Guiding Questions
-
Does value iteration always converge?
-
Is the value function unique?
-
Can there be multiple optimal policies?
-
Is there always a deterministic optimal policy?
Value Iteration: The Bellman Operator
Algorithm: Value Iteration
while \(\lVert V - V' \rVert_\infty > \epsilon\)
\(V \gets V'\)
\(V' \gets B[V]\)
return \(V'\)
\[B[V](s) = \underset{a \in A}{\text{max}}\left(R(s, a) + \gamma E \left[V(s')\right]\right)\]
Value Iteration Convergence
Theorem 1: Let \(\{V_1, \ldots, V_\infty\}\) be a sequence of value functions for a discrete MDP generated by the recurrence \(V_{k+1} = B[V_k]\). If \(\gamma < 1\), then \(\lim_{k\to\infty} V_k = V^*\).
Metrics
Definition: Let \(M\) be a set. A metric on \(M\) is a function \(d: M \times M \to [0, \infty)\) which satisfies the following three conditions for all \(x, y, z \in M\):
- \(d(x, y) = 0\) if and only if \(x=y\)
- \(d(x, y) = d(y, x)\)
- \(d(x, y) \leq d(x, z) + d(z, y)\)
Contraction Mappings
Definition: A contraction mapping on metric space \((M, d)\) is a function \(f: M \to M\) satisfying
\[d(f(x), f(y)) \leq \alpha \, d(x, y)\]
for some \(\alpha\), \(0 \leq \alpha \leq 1\) and all \(x\) and \(y\) in \(M\).
Definition: \(x^*\) is said to be a fixed point of \(f\) if \(f(x^*) = x^*\).
Script: contraction_mapping.jl
Banach's Theorem
Theorem (Banach): If \(f\) is a contraction mapping on metric space \((M, d)\), then
- \(f\) has a single, unique fixed point \(x^*\).
- If \(\{x_k\}\) is a sequence defined by \(x_{k+1} = f(x_k)\), then \(\lim_{k\to\infty} x_k = x^*\).
Max Norm
Lemma 1: \(\left( \mathbb{R}^{|S|}, \lVert \cdot \rVert_{\infty}\right)\) is a metric space.
Proof:
Definition: Let \(M\) be a set. A metric on \(M\) is a function \(d: M \times M \to [0, \infty)\) which satisfies the following three conditions for all \(x, y, z \in M\):
- \(d(x, y) = 0\) if and only if \(x=y\)
- \(d(x, y) = d(y, x)\)
- \(d(x, y) \leq d(x, z) + d(z, y)\)
Note: \(\lVert x-y \rVert_{\infty} = \max_i |x_i-y_i |\)
- \(\max|x-y| = 0 \text{ iff } x_i = y_i \quad\forall i\)
- \(|x-y| = |-(x-y)| = |y-x|\)
\(\therefore \quad \max |x-y| = max|y-x|\)
- \(\max |x-z| = \max | x-y+y-z |\)
\(\leq \max(|x-y| + |y-z|)\)
\(\leq \max|x-y| + \max|y-z|\)
Bellman Operator Contraction
\[\lVert B[V_1] - B[V_2] \rVert_\infty = \max_{s\in S}\left|B[V_1](s) - B[V_2](s)\right|\]
\[= \max_{s \in S} \left| \max_{a \in A} \left(R(s, a) + \gamma \sum_{s'\in S} T(s'|s,a) V_1(s')\right) - \max_{a \in A} \left(R(s, a) + \gamma \sum_{s'\in S} T(s'|s,a) V_2(s')\right)\right|\]
Lemma 2: \(B\) is a \(\gamma\) contraction mapping on \((\mathbb{R}^{|S|}, \lVert \cdot \rVert_\infty)\).
\[\leq \max_{s \in S} \left| \max_{a \in A} \left(R(s, a) + \gamma \sum_{s'\in S} T(s'|s,a) V_1(s') - R(s, a) - \gamma \sum_{s'\in S} T(s'|s,a) V_2(s')\right)\right|\]
\[\leq \max_{s \in S, a \in A} \left|\gamma \sum_{s'\in S} T(s'|s,a) \left(V_1(s') - V_2(s')\right)\right|\]
\[\leq \max_{s \in S, a \in A} \gamma \sum_{s'\in S} T(s'|s,a) \left| V_1(s') - V_2(s')\right|\]
\[\leq \max_{s \in S, a \in A} \gamma \sum_{s'\in S} T(s'|s,a) \lVert V_1 - V_2\rVert_\infty\]
\[= \gamma \lVert V_1 - V_2\rVert_\infty \max_{s \in S, a \in A} \sum_{s'\in S} T(s'|s,a) \]
\[= \gamma \lVert V_1 - V_2\rVert_\infty \]
\(|\max(x)| \leq \max |x|\)
Proof
Value Iteration Convergence
Theorem 1: Let \(\{V_1, \ldots, V_\infty\}\) be a sequence of value functions for a discrete MDP generated by the recurrence \(V_{k+1} = B[V_k]\). If \(\gamma < 1\), then \(\lim_{k\to\infty} V_k = V^*\).
Proof:
By Lemma 2 and Banach's theorem (part 2), repeated application of the Bellman operator always has a fixed point limit, \(\hat{V}\).
By Banach's theorem (part 1), \(\hat{V} = B[\hat{V}]\). Since \(\hat{V}\) satisfies Bellman's equation, it is optimal and \(\hat{V} = V^*\).
Lemma 2: \(B\) is a \(\gamma\) contraction mapping on \((\mathbb{R}^{|S|}, \lVert \cdot \rVert_\infty)\).
Theorem (Banach): If \(f\) is a contraction mapping on metric space \((M, d)\), then
- \(f\) has a single, unique fixed point \(x^*\).
- If \(\{x_k\}\) is a sequence defined by \(x_{k+1} = f(x_k)\), then \(\lim_{k\to\infty} x_k = x^*\).
Does Policy Iteration Converge?
Theorem: Policy iteration converges to an optimal policy for a finite MDP in finite time.
Proof (sketch):
- The policy will either improve or stay the same at each iteration
- The policy will stay the same if and only if \(V^\pi\) = \(V^*\)
- There are a finite number of possible policies
- By (1), (2), and (3), the policy will improve until it finds the optimal policy, and it will always find the optimal policy.
Is there always a deterministic optimal policy?
Guiding Questions
-
Does value iteration always converge?
-
Is the value function unique?
-
Can there be multiple optimal policies?
-
Is there always a deterministic optimal policy?
Break
Conservation MDP
060 Value Iteration Convergence
By Zachary Sunberg
060 Value Iteration Convergence
- 244