Policy and Value Iteration

Last Time

How is a Markov decision process defined?
What is a policy?
How do we evaluate policies?

(MDP notebook)

Guiding Questions

How do we reason about the future consequences of actions in an MDP?
What are the basic algorithms for solving MDPs?

Value-Based Policy Evaluation

MDP Example: Up-Down Problem

For this lecture, => is same as ->> (distinguishes from Bayes Net)

Dynamic Programming and Value Backup

Bellman's Principle of Optimality: Every sub-policy in an optimal policy is locally optimal

Break: DIA Run

Policy Iteration

Algorithm: Policy Iteration

Given: MDP \((S, A, R, T, \gamma, b)\)

initialize \(\pi\), \(\pi'\) (differently)
while \(\pi \neq \pi'\)
\(\pi \gets \pi'\)
\(U^\pi \gets (I - \gamma T^\pi )^{-1} R^\pi\)
\(\pi'(s) \gets \underset{a \in A}{\text{argmax}} \left(R(s, a) + \gamma \sum_{s' \in S} T(s' | s, a) U^\pi (s') \right) \quad \forall s \in S\)
return \(\pi\)

(Policy iteration notebook)

Value Iteration

Algorithm: Value Iteration

Given: MDP \((S, A, R, T, \gamma, b)\), tolerance \(\epsilon\)

initialize \(U\), \(U'\) (differently)
while \(\lVert U - U' \rVert_\infty > \epsilon\)
\(U \gets U'\)
\(U'(s) \gets \underset{a \in A}{\text{max}} \left(R(s, a) + \gamma \sum_{s' \in S} T(s' | s, a) U (s') \right) \quad \forall s \in S\)
return \(U'\)

Returned \(U'\) will be close to \(U^*\)!
\(\pi^*\) is easy to extract: \(\pi^*(s) = \arg\max( R(s, a) + \gamma E[U^*(s)])\)

Bellman's Equations

Guiding Questions

How do we reason about the future consequences of actions in an MDP?
What are the basic algorithms for solving MDPs?

"In any small change he will have to consider only these quantitative indices (or "values") in which all the relevant information is concentrated; and by adjusting the quantities one by one, he can appropriately rearrange his dispositions without having to solve the whole puzzle ab initio, or without needing at any stage to survey it at once in all its ramifications."

-- F. A. Hayek, "The use of knowledge in society", 1945