How is a Markov decision process defined?
What is a policy?
How do we evaluate policies?
(MDP notebook)
How do we reason about the future consequences of actions in an MDP?
What are the basic algorithms for solving MDPs?
For this lecture, => is same as ->> (distinguishes from Bayes Net)
Bellman's Principle of Optimality: Every sub-policy in an optimal policy is locally optimal
Algorithm: Policy Iteration
Given: MDP \((S, A, R, T, \gamma, b)\)
(Policy iteration notebook)
Algorithm: Value Iteration
Given: MDP \((S, A, R, T, \gamma, b)\), tolerance \(\epsilon\)
How do we reason about the future consequences of actions in an MDP?
What are the basic algorithms for solving MDPs?
"In any small change he will have to consider only these quantitative indices (or "values") in which all the relevant information is concentrated; and by adjusting the quantities one by one, he can appropriately rearrange his dispositions without having to solve the whole puzzle ab initio, or without needing at any stage to survey it at once in all its ramifications."
-- F. A. Hayek, "The use of knowledge in society", 1945