Markov Decision Processes
Last Time
- What does "Markov" mean in "Markov Process"?
Guiding Questions
-
What is a Markov decision process?
-
What is a policy?
-
How do we evaluate policies?
Decision Networks and MDPs
Decision Network
Chance node
Decision node
Utility node
MDP Dynamic Decision Network
MDP Optimization problem
\[\text{maximize} \quad \text{E}\left[\sum_{t=1}^\infty r_t\right]\]
Not well formulated!
Infinite
Finite MDP Objectives
- Finite time
- Average reward
- Discounting
- Terminal States
\[\text{E} \left[ \sum_{t=0}^T r_t \right]\]
\[\underset{n \rightarrow \infty}{\text{lim}} \text{E} \left[\frac{1}{n}\sum_{t=0}^n r_t \right] \]
\[\text{E} \left[\sum_{t=0}^\infty \gamma^t r_t\right]\]
Infinite time, but a terminal state (no reward, no leaving) is always reached with probability 1.
discount \(\gamma \in [0, 1)\)
typically 0.9, 0.95, 0.99
if \(\underline{r} \leq r_t \leq \bar{r}\)
then \[\frac{\underline{r}}{1-\gamma} \leq \sum_{t=0}^\infty \gamma^t r_t \leq \frac{\bar{r}}{1-\gamma} \]
MDP "Tuple Definition"
\((S, A, T, R, \gamma)\)
(and \(b\) in some contexts)
- \(S\) (state space) - set of all possible states
- \(A\) (action space) - set of all possible actions
- \(T\) (transition distribution) - explicit or implicit ("generative") model of how the state changes
- \(R\) (reward function) - maps each state and action to a reward
- \(\gamma\): discount factor
- \(b\): initial state distribution
\(\{1,2,3\}\)
\(\{\text{healthy},\text{pre-cancer},\text{cancer}\}\)
\(\mathbb{R}^2\)
\((s, i, r) \in \mathbb{N}^3\)
\(\{0,1\}\times\mathbb{R}^4\)
\((x,y) \in\)
\(\{1,2,3\}\)
\(\{\text{test},\text{wait},\text{treat}\}\)
\(\mathbb{R}^2\)
\(\{0,1\}\times\mathbb{R}^2\)
\(T(s' \mid s, a)\)
\(R(s, a)\) or \(R(s, a, s')\)
\(s', r = G(s, a)\)
MDP Example
Imagine it's a cold day and you're ready to go to work. You have to decide whether to bike or drive.
- If you drive, you will have to pay $15 for parking; biking is free.
- On 1% of cold days, the ground is covered in ice and you will crash if you bike, but you can't discover this until you start riding. After your crash, you limp home with pain equivalent to losing $100.
Policies and Simulation
- A policy, denoted with \(\pi\), as in \(a_t = \pi(s_t)\) is a function mapping every state to an action.
- When a policy is combined with a Markov decision process, it becomes a Markov stochastic process with \[P(s' \mid s) = T(s' \mid s, \pi(s))\]
MDP Simulation
Algorithm: Rollout Simulation
Given: MDP \((S, A, R, T, \gamma, b)\)
\(s \gets \text{sample}(b)\)
\(\hat{u} \gets 0\)
for \(t\) in \(0 \ldots T-1\)
\(a \gets \pi(s)\)
\(s', r \gets G(s, a)\)
\(\hat{u} \gets \hat{u} + \gamma^t r\)
\(s \gets s'\)
return \(\hat{u}\)
Break
- Suggest a policy that you think is optimal for the icy day problem
Utility
Slide not on Exam
Policy Evaluation
Naive Policy Evaluation not on Exam
Monte Carlo Policy Evaluation
- Running a large number of simulations and averaging the accumulated reward is called Monte Carlo Evaluation
Let \(\tau = (s_0, a_0, r_0, s_1, \ldots, s_T)\) be a trajectory of the MDP
\[U(\pi) \approx \frac{1}{m} \sum_{i=1}^m R(\tau^{(i)})\]
\[U(\pi) \approx \bar{u}_m = \frac{1}{m} \sum_{i=1}^m \hat{u}^{(i)}\]
where \(\hat{u}^{(i)}\) is generated by a rollout simulation
How can we quantify the accuracy of \(\bar{u}_m\)?
Standard Error of the Mean
Value Function-Based Policy Evaluation
Guiding Questions
-
What is a Markov decision process?
-
What is a policy?
-
How do we evaluate policies?
040 Markov Decision Processes
By Zachary Sunberg
040 Markov Decision Processes
- 215