Zachary Sunberg
Assistant Professor
CU Boulder
Waymo Image By Dllu - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=64517567
Two Objectives for Autonomy
Minimize resource use
(especially time)
Minimize the risk of harm to oneself and others
Safety often opposes Efficiency
Pareto Optimization
Safety
Better Performance
Model \(M_2\), Algorithm \(A_2\)
Model \(M_1\), Algorithm \(A_1\)
Efficiency
$$\underset{\pi}{\mathop{\text{maximize}}} \, J^\pi = J^\pi_\text{E} + \lambda J^\pi_\text{S}$$
Safety
Weight
Efficiency
Types of Uncertainty
Alleatory
Epistemic (Static)
Epistemic (Dynamic)
$$\underset{\pi}{\mathop{\text{maximize}}} \sum_{t=0}^\infty \gamma^t R(s_t, a_t)$$
$$s_{t+1} = f(s_t, a_t, w_t)$$
Policy: \(a_t = \pi(s_t)\)
Markov Decision Process (MDP)
$$\underset{\pi}{\mathop{\text{maximize}}} \sum_{t=0}^\infty \gamma R(s_t, a_t)$$
$$s_{t+1} = f(s_t, a_t, w_t)$$
Policy: \(a_t = \pi(a_0, o_0, a_1, o_1, ... o_{t-1})\)
$$o_{t} = h(s_t, a_t, s_{t+1}, v_t)$$
Partially Observable Markov Decision Process (POMDP)
State
Timestep
Accurate Observations
Goal: \(a=0\) at \(s=0\)
Optimal Policy
Localize
\(a=0\)
Environment
Belief Updater
Policy
\(b\)
\(a\)
\[b_t(s) = P\left(s_t = s \mid a_1, o_1 \ldots a_{t-1}, o_{t}\right) \\ \quad\quad\quad\propto Z(o_{t} \mid a_t, s_t) \sum_s T(s_t \mid a_{t-1}, s) b_{t-1}(s)\]
True State
\(s = 7\)
Observation \(o = -0.21\)
Cancer Screening and Treatment
Your Task:
Solving MDPs - The Value Function
$$V^*(s) = \underset{a\in\mathcal{A}}{\max} \left\{R(s, a) + \gamma E\Big[V^*\left(s_{t+1}\right) \mid s_t=s, a_t=a\Big]\right\}$$
Involves all future time
Involves only \(t\) and \(t+1\)
$$\underset{\pi:\, \mathcal{S}\to\mathcal{A}}{\mathop{\text{maximize}}} \, V^\pi(s) = E\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, \pi(s_t)) \bigm| s_0 = s \right]$$
$$Q(s,a) = R(s, a) + \gamma E\Big[V^* (s_{t+1}) \mid s_t = s, a_t=a\Big]$$
Value = expected sum of future rewards
Repeatedly Apply Bellman Equation
Online: Tree Search
Time
Estimate \(Q(s, a)\) based on children
$$Q(s,a) = R(s, a) + \gamma E\Big[V^* (s_{t+1}) \mid s_t = s, a_t=a\Big]$$
\[V(s) = \max_a Q(s,a)\]
Monte Carlo Tree Search
Image by Dicksonlaw583 (CC 4.0)
Environment
Belief Updater
Policy
\(o\)
\(b\)
\(a\)
[Ross, 2008] [Silver, 2010]
*(Partially Observable Monte Carlo Planning)
Tweet by Nitin Gupta
29 April 2018
https://twitter.com/nitguptaa/status/990683818825736192
Intelligent Driver Model (IDM)
[Treiber, et al., 2000] [Kesting, et al., 2007] [Kesting, et al., 2009]
Internal States
All drivers normal
No learning (MDP)
Omniscient
POMCPOW (Ours)
Simulation results
[Sunberg, 2017]
Marginal Distribution: Uniform
\(\rho=0\)
\(\rho=1\)
\(\rho=0.75\)
Internal parameter distributions
Conditional Distribution: Copula
Assume normal
No Learning (MDP)
Omniscient
Mean MPC
QMDP
POMCPOW (Ours)
[Sunberg, 2017]
POMCP
POMCP-DPW
POMCPOW
-18.46
-18.46
51.85
$$a' = a + \eta \nabla_a Q(s, a)$$
[Lim, Tomlin, & Sunberg CDC 2021]
[Mern, Sunberg, et al. AAAI 2021]
POMDPs.jl - An interface for defining and solving MDPs and POMDPs in Julia
[Egorov, Sunberg, et al., 2017]
Celeste Project
1.54 Petaflops
Previous C++ framework: APPL
"At the moment, the three packages are independent. Maybe one day they will be merged in a single coherent framework."
[Egorov, Sunberg, et al., 2017]
The content of my research reflects my opinions and conclusions, and is not necessarily endorsed by my funding organizations.