Policy and Value Iteration

Last Time

How is a Markov decision process defined?
What is a policy?
How do we evaluate policies?

(MDP notebook)

Guiding Questions

How do we reason about the future consequences of actions in an MDP?
What are the basic algorithms for solving MDPs?

Value-Based Policy Evaluation

MDP Example: Up-Down Problem

For this lecture, => is same as ->> (distinguishes from Bayes Net)

Dynamic Programming and Value Backup

Bellman's Principle of Optimality: Every sub-policy in an optimal policy is locally optimal

Break: DIA Run

Policy Iteration

Algorithm: Policy Iteration

Given: MDP \((S, A, R, T, \gamma, b)\)

initialize \(\pi\), \(\pi'\) (differently)
while \(\pi \neq \pi'\)
\(\pi \gets \pi'\)
\(U^\pi \gets (I - \gamma T^\pi )^{-1} R^\pi\)
\(\pi'(s) \gets \underset{a \in A}{\text{argmax}} \left(R(s, a) + \gamma \sum_{s' \in S} T(s' | s, a) U^\pi (s') \right) \quad \forall s \in S\)
return \(\pi\)

(Policy iteration notebook)

Value Iteration

Algorithm: Value Iteration

Given: MDP \((S, A, R, T, \gamma, b)\), tolerance \(\epsilon\)

initialize \(U\), \(U'\) (differently)
while \(\lVert U - U' \rVert_\infty > \epsilon\)
\(U \gets U'\)
\(U'(s) \gets \underset{a \in A}{\text{max}} \left(R(s, a) + \gamma \sum_{s' \in S} T(s' | s, a) U (s') \right) \quad \forall s \in S\)
return \(U'\)

Returned \(U'\) will be close to \(U^*\)!
\(\pi^*\) is easy to extract: \(\pi^*(s) = \arg\max( R(s, a) + \gamma E[U^*(s)])\)

Bellman's Equations

Guiding Questions

How do we reason about the future consequences of actions in an MDP?
What are the basic algorithms for solving MDPs?

"In any small change he will have to consider only these quantitative indices (or "values") in which all the relevant information is concentrated; and by adjusting the quantities one by one, he can appropriately rearrange his dispositions without having to solve the whole puzzle ab initio, or without needing at any stage to survey it at once in all its ramifications."

Policy and Value Iteration

Last Time

Guiding Questions

Value-Based Policy Evaluation

MDP Example: Up-Down Problem

MDP Example: Up-Down Problem

Dynamic Programming and Value Backup

Break: DIA Run

Break: DIA Run

Policy Iteration

Value Iteration

Bellman's Equations

Guiding Questions

050 Policy and Value Iteration

050 Policy and Value Iteration

Zachary Sunberg

Policy and Value Iteration

Last Time

Guiding Questions

Value-Based Policy Evaluation

MDP Example: Up-Down Problem

MDP Example: Up-Down Problem

Dynamic Programming and Value Backup

Break: DIA Run

Break: DIA Run

Policy Iteration

Value Iteration

Bellman's Equations

Guiding Questions

050 Policy and Value Iteration

More from Zachary Sunberg