What is a Markov decision process?
What is a policy?
How do we evaluate policies?
Decision Network
Chance node
Decision node
Utility node
MDP Dynamic Decision Network
MDP Optimization problem
\[\text{maximize} \quad \text{E}\left[\sum_{t=1}^\infty r_t\right]\]
Not well formulated!
Infinite
\[\text{E} \left[ \sum_{t=0}^T r_t \right]\]
\[\underset{n \rightarrow \infty}{\text{lim}} \text{E} \left[\frac{1}{n}\sum_{t=0}^n r_t \right] \]
\[\text{E} \left[\sum_{t=0}^\infty \gamma^t r_t\right]\]
Infinite time, but a terminal state (no reward, no leaving) is always reached with probability 1.
discount \(\gamma \in [0, 1)\)
typically 0.9, 0.95, 0.99
if \(\underline{r} \leq r_t \leq \bar{r}\)
then \[\frac{\underline{r}}{1-\gamma} \leq \sum_{t=0}^\infty \gamma^t r_t \leq \frac{\bar{r}}{1-\gamma} \]
\((S, A, T, R, \gamma)\)
(and \(b\) in some contexts)
\(\{1,2,3\}\)
\(\{\text{healthy},\text{pre-cancer},\text{cancer}\}\)
\(\mathbb{R}^2\)
\((s, i, r) \in \mathbb{N}^3\)
\(\{0,1\}\times\mathbb{R}^4\)
\((x,y) \in\)
\(\{1,2,3\}\)
\(\{\text{test},\text{wait},\text{treat}\}\)
\(\mathbb{R}^2\)
\(\{0,1\}\times\mathbb{R}^2\)
\(T(s' \mid s, a)\)
\(R(s, a)\) or \(R(s, a, s')\)
\(s', r = G(s, a)\)
Imagine it's a cold day and you're ready to go to work. You have to decide whether to bike or drive.
Algorithm: Rollout Simulation
Given: MDP \((S, A, R, T, \gamma, b)\)
\(s \gets \text{sample}(b)\)
\(\hat{u} \gets 0\)
for \(t\) in \(0 \ldots T-1\)
\(a \gets \pi(s)\)
\(s', r \gets G(s, a)\)
\(\hat{u} \gets \hat{u} + \gamma^t r\)
\(s \gets s'\)
return \(\hat{u}\)
Slide not on Exam
Naive Policy Evaluation not on Exam
Let \(\tau = (s_0, a_0, r_0, s_1, \ldots, s_T)\) be a trajectory of the MDP
\[U(\pi) \approx \frac{1}{m} \sum_{i=1}^m R(\tau^{(i)})\]
\[U(\pi) \approx \bar{u}_m = \frac{1}{m} \sum_{i=1}^m \hat{u}^{(i)}\]
where \(\hat{u}^{(i)}\) is generated by a rollout simulation
How can we quantify the accuracy of \(\bar{u}_m\)?
Standard Error of the Mean
What is a Markov decision process?
What is a policy?
How do we evaluate policies?