Professor Zachary Sunberg
May 12th, 2022
Waymo Image By Dllu - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=64517567
Two Objectives for Autonomy
Minimize resource use
(especially time)
Minimize the risk of harm to oneself and others
Safety often opposes Efficiency
Tweet by Nitin Gupta
29 April 2018
https://twitter.com/nitguptaa/status/990683818825736192
Pareto Optimization
Safety
Better Performance
Model \(M_2\), Algorithm \(A_2\)
Model \(M_1\), Algorithm \(A_1\)
Efficiency
$$\underset{\pi}{\mathop{\text{maximize}}} \, V^\pi = V^\pi_\text{E} + \lambda V^\pi_\text{S}$$
Safety
Weight
Efficiency
Alleatory
Epistemic (Static)
Epistemic (Dynamic)
Interaction
MDP
RL
POMDP
Game
Markov Decision Process (MDP)
Alleatory
Partially Observable Markov Decision Process (POMDP)
Alleatory
Epistemic (Static)
Epistemic (Dynamic)
Alleatory
Epistemic (Static)
Epistemic (Dynamic)
Interaction
Image from Russel and Norvig
Solving MDPs - The Value Function
$$V^*(s) = \underset{a\in\mathcal{A}}{\max} \left\{R(s, a) + \gamma E\Big[V^*\left(s_{t+1}\right) \mid s_t=s, a_t=a\Big]\right\}$$
Involves all future time
Involves only \(t\) and \(t+1\)
$$\underset{\pi:\, \mathcal{S}\to\mathcal{A}}{\mathop{\text{maximize}}} \, V^\pi(s) = E\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, \pi(s_t)) \bigm| s_0 = s \right]$$
$$Q(s,a) = R(s, a) + \gamma E\Big[V^* (s_{t+1}) \mid s_t = s, a_t=a\Big]$$
Value = expected sum of future rewards
State
Timestep
Accurate Observations
Goal: \(a=0\) at \(s=0\)
Optimal Policy
Localize
\(a=0\)
Environment
Belief Updater
Policy/Planner
\(b\)
\(a\)
\[b_t(s) = P\left(s_t = s \mid a_1, o_1 \ldots a_{t-1}, o_{t-1}\right)\]
True State
\(s = 7\)
Observation \(o = -0.21\)
SARSOP can solve some POMDPs with thousands of states offline
but
The POMDP is PSPACE-Complete
Intractable!
Online Tree Search in MDPs
Time
Estimate \(Q(s, a)\) based on children
$$Q(s,a) = R(s, a) + \gamma E\Big[V^* (s_{t+1}) \mid s_t = s, a_t=a\Big]$$
\[V(s) = \max_a Q(s,a)\]
Expand for all actions (\(\left|\mathcal{A}\right| = 2\) in this case)
...
Expand for all \(\left|\mathcal{S}\right|\) states
\(C=3\) states
...
[Kearns, et al., 2002]
1. Near-optimal policy: \(\left|V^A(s) - V^*(s) \right|\leq \epsilon\)
2. Running time independent of state space size:
\(O \left( ( \left|\mathcal{A} \right|C )^H \right) \)
[Ross, 2008] [Silver, 2010]
*(Partially Observable Monte Carlo Planning)
Fails in Continuous Observation Spaces
POMCP
POMCP-DPW
POMCPOW
[Sunberg and Kochenderfer, ICAPS 2018]
MDP trained on normal drivers
MDP trained on all drivers
Omniscient
POMCPOW (Ours)
Simulation results
[Sunberg & Kochenderfer, T-ITS Under Review]
Our simplified algorithm is near-optimal
[Lim, Tomlin, & Sunberg, IJCAI 2020]
Conventional 1D POMDP
2D POMDP
Intention-Aware Navigation in Crowds with Extended-Space POMDP Planning. Gupta, H.; Hayes, B.; and Sunberg, Z. AAMAS, 2022.
[Mern, Sunberg, et al. AAAI 2021]
[Lim, Tomlin, & Sunberg CDC 2021]
[Peters, Tomlin, and Sunberg 2020]
1
2
...
...
...
...
...
...
...
\(N\)
Tyler Becker and Zachary Sunberg. “Imperfect Information Games and Counterfac-
tual Regret Minimization in Space Domain Awareness”. Abstract under review for the
Advanced Maui Optical and Space Surveillance Technologies conference.
POMDPs.jl - An interface for defining and solving MDPs and POMDPs in Julia
Explicit
Black Box
("Generative" in POMDP lit.)
\(s,a\)
\(s', o, r\)
Previous C++ framework: APPL
"At the moment, the three packages are independent. Maybe one day they will be merged in a single coherent framework."
Celeste Project
1.54 Petaflops
Individual Infectiousness
Infection Age
Incident Infections
Need
Test sensitivity is secondary to frequency and turnaround time for COVID-19 surveillance
Larremore et al.
Viral load represented by piecewise-linear hinge function