### Safe and efficient autonomy in the face of state and interaction uncertainty

Professor Zachary Sunberg

May 12th, 2022

## Mission: deploy autonomy with confidence

Waymo Image By Dllu - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=64517567

Two Objectives for Autonomy

### SAFETY

Minimize resource use

(especially time)

Minimize the risk of harm to oneself and others

Safety often opposes Efficiency

Tweet by Nitin Gupta

29 April 2018

Pareto Optimization

Safety

Better Performance

Model $$M_2$$, Algorithm $$A_2$$

Model $$M_1$$, Algorithm $$A_1$$

Efficiency

$$\underset{\pi}{\mathop{\text{maximize}}} \, V^\pi = V^\pi_\text{E} + \lambda V^\pi_\text{S}$$

Safety

Weight

Efficiency

### Types of Uncertainty

Alleatory

Epistemic (Static)

Epistemic (Dynamic)

Interaction

MDP

RL

POMDP

Game

Markov Decision Process (MDP)

• $$\mathcal{S}$$ - State space
• $$T:\mathcal{S}\times \mathcal{A} \times\mathcal{S} \to \mathbb{R}$$ - Transition probability distribution
• $$\mathcal{A}$$ - Action space
• $$R:\mathcal{S}\times \mathcal{A} \to \mathbb{R}$$ - Reward

Alleatory

Partially Observable Markov Decision Process (POMDP)

• $$\mathcal{S}$$ - State space
• $$T:\mathcal{S}\times \mathcal{A} \times\mathcal{S} \to \mathbb{R}$$ - Transition probability distribution
• $$\mathcal{A}$$ - Action space
• $$R:\mathcal{S}\times \mathcal{A} \to \mathbb{R}$$ - Reward
• $$\mathcal{O}$$ - Observation space
• $$Z:\mathcal{S} \times \mathcal{A}\times \mathcal{S} \times \mathcal{O} \to \mathbb{R}$$ - Observation probability distribution

Alleatory

Epistemic (Static)

Epistemic (Dynamic)

## Incomplete Information Game

Alleatory

Epistemic (Static)

Epistemic (Dynamic)

Interaction

• Finite set of $$n$$ players, plus the "chance" player
• $$P(h)$$ (player at each history)
• $$A(h)$$ (set of actions at each history)
• $$I(h)$$ (information set that each history maps to)
• $$U(h)$$ (payoff for each leaf node in the game tree)

Image from Russel and Norvig

Solving MDPs - The Value Function

$$V^*(s) = \underset{a\in\mathcal{A}}{\max} \left\{R(s, a) + \gamma E\Big[V^*\left(s_{t+1}\right) \mid s_t=s, a_t=a\Big]\right\}$$

Involves all future time

Involves only $$t$$ and $$t+1$$

$$\underset{\pi:\, \mathcal{S}\to\mathcal{A}}{\mathop{\text{maximize}}} \, V^\pi(s) = E\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, \pi(s_t)) \bigm| s_0 = s \right]$$

$$Q(s,a) = R(s, a) + \gamma E\Big[V^* (s_{t+1}) \mid s_t = s, a_t=a\Big]$$

Value = expected sum of future rewards

## Value Iteration

\begin{aligned} & \mathcal{S} = \mathbb{Z} \quad \quad \quad ~~ \mathcal{O} = \mathbb{R} \\ & s' = s+a \quad \quad o \sim \mathcal{N}(s, s-10) \\ & \mathcal{A} = \{-10, -1, 0, 1, 10\} \\ & R(s, a) = \begin{cases} 100 & \text{ if } a = 0, s = 0 \\ -100 & \text{ if } a = 0, s \neq 0 \\ -1 & \text{ otherwise} \end{cases} & \\ \end{aligned}

State

Timestep

Accurate Observations

Goal: $$a=0$$ at $$s=0$$

Optimal Policy

Localize

$$a=0$$

## POMDP Example: Light-Dark

### POMDP Sense-Plan-Act Loop

Environment

Belief Updater

Policy/Planner

$$b$$

$$a$$

$b_t(s) = P\left(s_t = s \mid a_1, o_1 \ldots a_{t-1}, o_{t-1}\right)$

True State

$$s = 7$$

Observation $$o = -0.21$$

## A POMDP is an MDP on the Belief Space

SARSOP can solve some POMDPs with thousands of states offline

but

The POMDP is PSPACE-Complete

Intractable!

Online Tree Search in MDPs

Time

Estimate $$Q(s, a)$$ based on children

$$Q(s,a) = R(s, a) + \gamma E\Big[V^* (s_{t+1}) \mid s_t = s, a_t=a\Big]$$

$V(s) = \max_a Q(s,a)$

## Sparse Sampling

Expand for all actions ($$\left|\mathcal{A}\right| = 2$$ in this case)

...

Expand for all $$\left|\mathcal{S}\right|$$ states

$$C=3$$ states

## Sparse Sampling

...

[Kearns, et al., 2002]

1. Near-optimal policy: $$\left|V^A(s) - V^*(s) \right|\leq \epsilon$$

2. Running time independent of state space size:

$$O \left( ( \left|\mathcal{A} \right|C )^H \right)$$

• A POMDP is an MDP on the Belief Space but belief updates are expensive
• Each belief is implicitly represented by a collection of unweighted particles

[Ross, 2008] [Silver, 2010]

*(Partially Observable Monte Carlo Planning)

Fails in Continuous Observation Spaces

POMCP

POMCP-DPW

POMCPOW

[Sunberg and Kochenderfer, ICAPS 2018]

MDP trained on normal drivers

MDP trained on all drivers

Omniscient

POMCPOW (Ours)

Simulation results

[Sunberg & Kochenderfer, T-ITS Under Review]

### Continuous Observation Analytical Results (POWSS)

Our simplified algorithm is near-optimal

[Lim, Tomlin, & Sunberg, IJCAI 2020]

Conventional 1D POMDP

2D POMDP

Intention-Aware Navigation in Crowds with Extended-Space POMDP Planning. Gupta, H.; Hayes, B.; and Sunberg, Z. AAMAS, 2022.

## POMDP Planning with Image Observations

### POMDPs with Continuous...

• PO-UCT (POMCP)
• DESPOT
• POMCPOW
• DESPOT-α
• LABECOP
• GPS-ABT
• VG-MCTS
• BOMCP
• VOMCPOW

## BOMCP

[Mern, Sunberg, et al. AAAI 2021]

## Voronoi Progressive Widening

[Lim, Tomlin, & Sunberg CDC 2021]

# Games

## Interaction Uncertainty

[Peters, Tomlin, and Sunberg 2020]

## Space Domain Awareness Games

1

2

...

...

...

...

...

...

...

$$N$$

Tyler Becker and Zachary Sunberg. “Imperfect Information Games and Counterfac-
tual Regret Minimization in Space Domain Awareness”. Abstract under review for the
Advanced Maui Optical and Space Surveillance Technologies conference.

# Open Source Software

POMDPs.jl - An interface for defining and solving MDPs and POMDPs in Julia

## Challenges for POMDP Software

1. There is a huge variety of
• Problems
• Continuous/Discrete
• Fully/Partially Observable
• Generative/Explicit
• Simple/Complex
• Solvers
• Online/Offline
• Alpha Vector/Graph/Tree
• Exact/Approximate
• Domain-specific heuristics
2. POMDPs are computationally difficult.

Explicit

Black Box

("Generative" in POMDP lit.)

$$s,a$$

$$s', o, r$$

Previous C++ framework: APPL

"At the moment, the three packages are independent. Maybe one day they will be merged in a single coherent framework."

Celeste Project

1.54 Petaflops

# Recent and Current Projects

I(t) = \int_0^\infty I(t-\tau)\beta(\tau)d\tau

## COVID POMDP

Individual Infectiousness

Infection Age

Incident Infections

\beta(\tau)
\tau
I
I(t) = \int_0^\infty I(t-\tau)\beta(\tau)d\tau
\beta(\tau)

Need

Test sensitivity is secondary to frequency and turnaround time for COVID-19 surveillance

Larremore et al.

Viral load represented by piecewise-linear hinge function

(t_0, 3)
(t_{\text{peak}}, V_{\text{peak}})
(t_f,6)
t_0 \sim \mathcal{U}[2.5,3.5]
t_\text{peak} - t_0 \sim 0.2 + \text{Gamma}(1.8)
V_\text{peak} \sim \mathcal{U}[7,11]
t_f - t_\text{peak} \sim \mathcal{U}[5,10]
c_I = 100.0 \\ c_T = 1.0 \\ c_{TR} = 10.0

## MPC for Intermittent Rotor Failures

#### Sunberg: Safe and efficient autonomy

By Zachary Sunberg

• 43