Value Iteration Convergence

Review

How do we reason about the future consequences of actions in an MDP?
What are the basic algorithms for solving MDPs?

Guiding Questions

Does value iteration always converge?
Is the value function unique?
Can there be multiple optimal policies?
Is there always a deterministic optimal policy?

Value Iteration: The Bellman Operator

Algorithm: Value Iteration

while \(\lVert V - V' \rVert_\infty > \epsilon\)

\(V \gets V'\)

\(V' \gets B[V]\)

return \(V'\)

\[B[V](s) = \underset{a \in A}{\text{max}}\left(R(s, a) + \gamma E \left[V(s')\right]\right)\]

Value Iteration Convergence

Theorem 1: Let \(\{V_1, \ldots, V_\infty\}\) be a sequence of value functions for a discrete MDP generated by the recurrence \(V_{k+1} = B[V_k]\). If \(\gamma < 1\), then \(\lim_{k\to\infty} V_k = V^*\).

Metrics

Definition: Let \(M\) be a set. A metric on \(M\) is a function \(d: M \times M \to [0, \infty)\) which satisfies the following three conditions for all \(x, y, z \in M\):

\(d(x, y) = 0\) if and only if \(x=y\)
\(d(x, y) = d(y, x)\)
\(d(x, y) \leq d(x, z) + d(z, y)\)

Contraction Mappings

Definition: A contraction mapping on metric space \((M, d)\) is a function \(f: M \to M\) satisfying

\[d(f(x), f(y)) \leq \alpha \, d(x, y)\]

for some \(\alpha\), \(0 \leq \alpha \leq 1\) and all \(x\) and \(y\) in \(M\).

Definition: \(x^*\) is said to be a fixed point of \(f\) if \(f(x^*) = x^*\).

Script: contraction_mapping.jl

Banach's Theorem

Theorem (Banach): If \(f\) is a contraction mapping on metric space \((M, d)\), then

\(f\) has a single, unique fixed point \(x^*\).
If \(\{x_k\}\) is a sequence defined by \(x_{k+1} = f(x_k)\), then \(\lim_{k\to\infty} x_k = x^*\).

Max Norm

Lemma 1: \(\left( \mathbb{R}^{|S|}, \lVert \cdot \rVert_{\infty}\right)\) is a metric space.

Proof:

Definition: Let \(M\) be a set. A metric on \(M\) is a function \(d: M \times M \to [0, \infty)\) which satisfies the following three conditions for all \(x, y, z \in M\):

\(d(x, y) = 0\) if and only if \(x=y\)
\(d(x, y) = d(y, x)\)
\(d(x, y) \leq d(x, z) + d(z, y)\)

Note: \(\lVert x-y \rVert_{\infty} = \max_i |x_i-y_i |\)

\(\max|x-y| = 0 \text{ iff } x_i = y_i \quad\forall i\)
\(|x-y| = |-(x-y)| = |y-x|\)
\(\therefore \quad \max |x-y| = max|y-x|\)
\(\max |x-z| = \max | x-y+y-z |\)

\(\leq \max(|x-y| + |y-z|)\)

\(\leq \max|x-y| + \max|y-z|\)

Bellman Operator Contraction

\[\lVert B[V_1] - B[V_2] \rVert_\infty = \max_{s\in S}\left|B[V_1](s) - B[V_2](s)\right|\]

\[= \max_{s \in S} \left| \max_{a \in A} \left(R(s, a) + \gamma \sum_{s'\in S} T(s'|s,a) V_1(s')\right) - \max_{a \in A} \left(R(s, a) + \gamma \sum_{s'\in S} T(s'|s,a) V_2(s')\right)\right|\]

Lemma 2: \(B\) is a \(\gamma\) contraction mapping on \((\mathbb{R}^{|S|}, \lVert \cdot \rVert_\infty)\).

\[\leq \max_{s \in S} \left| \max_{a \in A} \left(R(s, a) + \gamma \sum_{s'\in S} T(s'|s,a) V_1(s') - R(s, a) - \gamma \sum_{s'\in S} T(s'|s,a) V_2(s')\right)\right|\]

\[\leq \max_{s \in S, a \in A} \left|\gamma \sum_{s'\in S} T(s'|s,a) \left(V_1(s') - V_2(s')\right)\right|\]

\[\leq \max_{s \in S, a \in A} \gamma \sum_{s'\in S} T(s'|s,a) \left| V_1(s') - V_2(s')\right|\]

\[\leq \max_{s \in S, a \in A} \gamma \sum_{s'\in S} T(s'|s,a) \lVert V_1 - V_2\rVert_\infty\]

\[= \gamma \lVert V_1 - V_2\rVert_\infty \max_{s \in S, a \in A} \sum_{s'\in S} T(s'|s,a) \]

\[= \gamma \lVert V_1 - V_2\rVert_\infty \]

\(|\max(x)| \leq \max |x|\)

Proof

Value Iteration Convergence

Proof:

By Lemma 2 and Banach's theorem (part 2), repeated application of the Bellman operator always has a fixed point limit, \(\hat{V}\).

By Banach's theorem (part 1), \(\hat{V} = B[\hat{V}]\). Since \(\hat{V}\) satisfies Bellman's equation, it is optimal and \(\hat{V} = V^*\).

Lemma 2: \(B\) is a \(\gamma\) contraction mapping on \((\mathbb{R}^{|S|}, \lVert \cdot \rVert_\infty)\).

Theorem (Banach): If \(f\) is a contraction mapping on metric space \((M, d)\), then

\(f\) has a single, unique fixed point \(x^*\).
If \(\{x_k\}\) is a sequence defined by \(x_{k+1} = f(x_k)\), then \(\lim_{k\to\infty} x_k = x^*\).

Does Policy Iteration Converge?

Theorem: Policy iteration converges to an optimal policy for a finite MDP in finite time.

Proof (sketch):

The policy will either improve or stay the same at each iteration
The policy will stay the same if and only if \(V^\pi\) = \(V^*\)
There are a finite number of possible policies
By (1), (2), and (3), the policy will improve until it finds the optimal policy, and it will always find the optimal policy.

Value Iteration Convergence

Review

Guiding Questions

Value Iteration: The Bellman Operator

Value Iteration Convergence

Metrics

Contraction Mappings

Banach's Theorem

Max Norm

Bellman Operator Contraction

Value Iteration Convergence

Does Policy Iteration Converge?

Is there always a deterministic optimal policy?

Guiding Questions

Break