Policy and Value Iteration
Last Time
-
How is a Markov decision process defined?
-
What is a policy?
-
How do we evaluate policies?
(MDP notebook)
Guiding Questions
-
How do we reason about the future consequences of actions in an MDP?
-
What are the basic algorithms for solving MDPs?
Value-Based Policy Evaluation
MDP Example: Up-Down Problem
MDP Example: Up-Down Problem
For this lecture, => is same as ->> (distinguishes from Bayes Net)
Dynamic Programming and Value Backup
Bellman's Principle of Optimality: Every sub-policy in an optimal policy is locally optimal
Break: DIA Run
Break: DIA Run
Policy Iteration
Algorithm: Policy Iteration
Given: MDP \((S, A, R, T, \gamma, b)\)
- initialize \(\pi\), \(\pi'\) (differently)
- while \(\pi \neq \pi'\)
- \(\pi \gets \pi'\)
- \(U^\pi \gets (I - \gamma T^\pi )^{-1} R^\pi\)
- \(\pi'(s) \gets \underset{a \in A}{\text{argmax}} \left(R(s, a) + \gamma \sum_{s' \in S} T(s' | s, a) U^\pi (s') \right) \quad \forall s \in S\)
- return \(\pi\)
(Policy iteration notebook)
Value Iteration
Algorithm: Value Iteration
Given: MDP \((S, A, R, T, \gamma, b)\), tolerance \(\epsilon\)
- initialize \(U\), \(U'\) (differently)
- while \(\lVert U - U' \rVert_\infty > \epsilon\)
- \(U \gets U'\)
- \(U'(s) \gets \underset{a \in A}{\text{max}} \left(R(s, a) + \gamma \sum_{s' \in S} T(s' | s, a) U (s') \right) \quad \forall s \in S\)
- return \(U'\)
- Returned \(U'\) will be close to \(U^*\)!
- \(\pi^*\) is easy to extract: \(\pi^*(s) = \arg\max( R(s, a) + \gamma E[U^*(s)])\)
Bellman's Equations
Guiding Questions
-
How do we reason about the future consequences of actions in an MDP?
-
What are the basic algorithms for solving MDPs?
"In any small change he will have to consider only these quantitative indices (or "values") in which all the relevant information is concentrated; and by adjusting the quantities one by one, he can appropriately rearrange his dispositions without having to solve the whole puzzle ab initio, or without needing at any stage to survey it at once in all its ramifications."
-- F. A. Hayek, "The use of knowledge in society", 1945
050 Policy and Value Iteration
By Zachary Sunberg
050 Policy and Value Iteration
- 215