What is a Markov decision process?
What is a policy?
How do we evaluate policies?
Decision Network
Chance node
Decision node
Utility node
MDP Dynamic Decision Network
MDP Optimization problem
\[\text{maximize} \quad \text{E}\left[\sum_{t=0}^\infty r_t\right]\]
Not well formulated!
Infinite
MDP Decision Network
\[\text{E} \left[ \sum_{t=0}^T r_t \right]\]
\[\underset{n \rightarrow \infty}{\text{lim}} \text{E} \left[\frac{1}{n}\sum_{t=0}^n r_t \right] \]
\[\text{E} \left[\sum_{t=0}^\infty \gamma^t r_t\right]\]
Infinite time, but a terminal state (no reward, no leaving) is always reached with probability 1.
discount \(\gamma \in [0, 1)\)
typically 0.9, 0.95, 0.99
if \(\underline{r} \leq r_t \leq \bar{r}\)
then \[\frac{\underline{r}}{1-\gamma} \leq \sum_{t=0}^\infty \gamma^t r_t \leq \frac{\bar{r}}{1-\gamma} \]
\((S, A, T, R, \gamma)\)
(and \(b\) and/or \(S_T\) in some contexts)
\(\{1,2,3\}\)
\(\{\text{healthy},\text{pre-cancer},\text{cancer}\}\)
\(\{1,2,3\}\)
\(\{\text{test},\text{wait},\text{treat}\}\)
\(T(s' \mid s, a)\)
\(R(s, a)\) or \(R(s, a, s')\)
\(s', r = G(s, a)\)
"Generative Model": Alternative to \(T\) and \(R\)
Imagine it's a cold day and you're ready to go to work. You have to decide whether to bike or drive.
Algorithm: Rollout Simulation
Inputs: MDP \((S, A, R, T, \gamma, b)\) (only need generative model, \(G\)), Policy \(\pi\), horizon \(H\)
Outputs: Utility estimate \(\hat{u}\)
\(s \gets \text{sample}(b)\)
\(\hat{u} \gets 0\)
for \(t\) in \(0 \ldots H-1\)
\(a \gets \text{sample}(\pi(a \mid s))\)
\(s', r \gets G(s, a)\)
\(\hat{u} \gets \hat{u} + \gamma^t r\)
\(s \gets s'\)
return \(\hat{u}\)
cf. Alg. 9.1, p. 184
MDP Objective:
\[\text{maximize}\;U(\pi) = \text{E} \left[\sum_{t=0}^\infty \gamma^t r_t \mid \pi \right]\]
Slide not on Exam
Naive Policy Evaluation not on Exam
Let \(\tau = (s_0, a_0, r_0, s_1, \ldots, s_T)\) be a trajectory of the MDP, and \(R(\tau) = \sum_{t=0}^\infty \gamma^t r_t\)
\[U(\pi) = \text{E} \left[\sum_{t=0}^\infty \gamma^t r_t \mid \pi \right]\]
\[U(\pi) = \text{E} \left[R(\tau) \mid \pi \right] = \sum_{\tau} R(\tau) P(\tau \mid \pi)\]
\[P(\tau \mid \pi) = b(s_0)\prod_{t=0}^\infty T(s_{t+1} \mid s_t, \pi(t))\]
Let \(\tau = (s_0, a_0, r_0, s_1, \ldots, s_T)\) be a trajectory of the MDP
\[U(\pi) \approx \frac{1}{m} \sum_{i=1}^m R(\tau^{(i)})\]
\[U(\pi) \approx \bar{u}_m = \frac{1}{m} \sum_{i=1}^m \hat{u}^{(i)}\]
where \(\hat{u}^{(i)}\) is generated by a rollout simulation
How can we quantify the accuracy of \(\bar{u}_m\)?
What is a Markov decision process?
What is a policy?
How do we evaluate policies?