What is a Markov decision process?
What is a policy?
How do we evaluate policies?
\[\text{E} \left[ \sum_{t=0}^T r_t \right]\]
\[\underset{n \rightarrow \infty}{\text{lim}} \text{E} \left[\frac{1}{n}\sum_{t=0}^n r_t \right] \]
\[\text{E} \left[\sum_{t=0}^\infty \gamma^t r_t\right]\]
Infinite time, but a terminal state (no reward, no leaving) is always reached with probability 1.
discount \(\gamma \in [0, 1)\)
typically 0.9, 0.95, 0.99
if \(\underline{r} \leq r_t \leq \bar{r}\)
then \[\frac{\underline{r}}{1-\gamma} \leq \sum_{t=0}^\infty \gamma^t r_t \leq \frac{\bar{r}}{1-\gamma} \]
\((S, A, T, R, \gamma)\)
(and \(b\) and/or \(S_T\) in some contexts)
\(\{1,2,3\}\)
\(\{\text{healthy},\text{pre-cancer},\text{cancer}\}\)
\(\mathbb{R}^2\)
\((s, i, r) \in \mathbb{N}^3\)
\(\{0,1\}\times\mathbb{R}^4\)
\((x,y) \in\)
\(\{1,2,3\}\)
\(\{\text{test},\text{wait},\text{treat}\}\)
\(\mathbb{R}^2\)
\(\{0,1\}\times\mathbb{R}^2\)
\(T(s' \mid s, a)\)
\(R(s, a)\) or \(R(s, a, s')\)
\(s', r = G(s, a)\)
Decision Network
Chance node
Decision node
Utility node
MDP Dynamic Decision Network
Imagine it's a cold day and you're ready to go to work. You have to decide whether to bike or drive.
Algorithm: Rollout Simulation
Inputs: MDP \((S, A, R, T, \gamma, b)\) (only need generative model, \(G\)), Policy \(\pi\), horizon \(H\)
Outputs: Utility estimate \(\hat{u}\)
\(s \gets \text{sample}(b)\)
\(\hat{u} \gets 0\)
for \(t\) in \(0 \ldots H-1\)
\(a \gets \text{sample}(\pi(a \mid s))\)
\(s', r \gets G(s, a)\)
\(\hat{u} \gets \hat{u} + \gamma^t r\)
\(s \gets s'\)
return \(\hat{u}\)
cf. Alg. 9.1, p. 184
MDP Objective:
\[U(\pi) = \text{E} \left[\sum_{t=0}^\infty \gamma^t r_t \mid \pi \right]\]
Algorithm: Rollout Simulation
Given: MDP \((S, A, R, T, \gamma, b)\)
\(s \gets \text{sample}(b)\)
\(\hat{u} \gets 0\)
for \(t\) in \(0 \ldots T-1\)
\(a \gets \pi(s)\)
\(s', r \gets G(s, a)\)
\(\hat{u} \gets \hat{u} + \gamma^t r\)
\(s \gets s'\)
return \(\hat{u}\)
Slide not on Exam
Naive Policy Evaluation not on Exam
Let \(\tau = (s_0, a_0, r_0, s_1, \ldots, s_T)\) be a trajectory of the MDP
\[U(\pi) \approx \frac{1}{m} \sum_{i=1}^m R(\tau^{(i)})\]
\[U(\pi) \approx \bar{u}_m = \frac{1}{m} \sum_{i=1}^m \hat{u}^{(i)}\]
where \(\hat{u}^{(i)}\) is generated by a rollout simulation
How can we quantify the accuracy of \(\bar{u}_m\)?
What is a Markov decision process?
What is a policy?
How do we evaluate policies?
How do we reason about the future consequences of actions in an MDP?
What are the basic algorithms for solving MDPs?
Bellman's Principle of Optimality: Every sub-policy in an optimal policy is locally optimal
Algorithm: Policy Iteration
Given: MDP \((S, A, R, T, \gamma)\)
(Policy iteration notebook)
Algorithm: Value Iteration
Given: MDP \((S, A, R, T, \gamma)\), tolerance \(\epsilon\)
How do we reason about the future consequences of actions in an MDP?
What are the basic algorithms for solving MDPs?
"In any small change he will have to consider only these quantitative indices (or "values") in which all the relevant information is concentrated; and by adjusting the quantities one by one, he can appropriately rearrange his dispositions without having to solve the whole puzzle ab initio, or without needing at any stage to survey it at once in all its ramifications."
-- F. A. Hayek, "The use of knowledge in society", 1945