### Safe and efficient autonomy in the face of state and interaction uncertainty

Professor Zachary Sunberg

May 12th, 2022

## Mission: deploy autonomy with confidence

Two Objectives for Autonomy

### SAFETY

Minimize resource use

(especially time)

Minimize the risk of harm to oneself and others

Safety often opposes Efficiency

Pareto Optimization

Safety

Better Performance

Model $$M_2$$, Algorithm $$A_2$$

Model $$M_1$$, Algorithm $$A_1$$

Efficiency

$$\underset{\pi}{\mathop{\text{maximize}}} \, V^\pi = V^\pi_\text{E} + \lambda V^\pi_\text{S}$$

Safety

Weight

Efficiency

### Types of Uncertainty

Alleatory

Epistemic (Static)

Epistemic (Dynamic)

Interaction

MDP

RL

POMDP

Game

Markov Decision Process (MDP)

• $$\mathcal{S}$$ - State space
• $$T:\mathcal{S}\times \mathcal{A} \times\mathcal{S} \to \mathbb{R}$$ - Transition probability distribution
• $$\mathcal{A}$$ - Action space
• $$R:\mathcal{S}\times \mathcal{A} \to \mathbb{R}$$ - Reward

Alleatory

Partially Observable Markov Decision Process (POMDP)

• $$\mathcal{S}$$ - State space
• $$T:\mathcal{S}\times \mathcal{A} \times\mathcal{S} \to \mathbb{R}$$ - Transition probability distribution
• $$\mathcal{A}$$ - Action space
• $$R:\mathcal{S}\times \mathcal{A} \to \mathbb{R}$$ - Reward
• $$\mathcal{O}$$ - Observation space
• $$Z:\mathcal{S} \times \mathcal{A}\times \mathcal{S} \times \mathcal{O} \to \mathbb{R}$$ - Observation probability distribution

Alleatory

Epistemic (Static)

Epistemic (Dynamic)

## Incomplete Information Game

Alleatory

Epistemic (Static)

Epistemic (Dynamic)

Interaction

• Finite set of $$n$$ players, plus the "chance" player
• $$P(h)$$ (player at each history)
• $$A(h)$$ (set of actions at each history)
• $$I(h)$$ (information set that each history maps to)
• $$U(h)$$ (payoff for each leaf node in the game tree)

Solving MDPs - The Value Function

$$V^*(s) = \underset{a\in\mathcal{A}}{\max} \left\{R(s, a) + \gamma E\Big[V^*\left(s_{t+1}\right) \mid s_t=s, a_t=a\Big]\right\}$$

Involves all future time

Involves only $$t$$ and $$t+1$$

$$\underset{\pi:\, \mathcal{S}\to\mathcal{A}}{\mathop{\text{maximize}}} \, V^\pi(s) = E\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, \pi(s_t)) \bigm| s_0 = s \right]$$

$$Q(s,a) = R(s, a) + \gamma E\Big[V^* (s_{t+1}) \mid s_t = s, a_t=a\Big]$$

Value = expected sum of future rewards

## Value Iteration

\begin{aligned} & \mathcal{S} = \mathbb{Z} \quad \quad \quad ~~ \mathcal{O} = \mathbb{R} \\ & s' = s+a \quad \quad o \sim \mathcal{N}(s, s-10) \\ & \mathcal{A} = \{-10, -1, 0, 1, 10\} \\ & R(s, a) = \begin{cases} 100 & \text{ if } a = 0, s = 0 \\ -100 & \text{ if } a = 0, s \neq 0 \\ -1 & \text{ otherwise} \end{cases} & \\ \end{aligned}

State

Timestep

Accurate Observations

Goal: $$a=0$$ at $$s=0$$

Optimal Policy

Localize

$$a=0$$

## POMDP Example: Light-Dark

### POMDP Sense-Plan-Act Loop

Environment

Belief Updater

Policy/Planner

$$b$$

$$a$$

$b_t(s) = P\left(s_t = s \mid a_1, o_1 \ldots a_{t-1}, o_{t-1}\right)$

True State

$$s = 7$$

Observation $$o = -0.21$$

## A POMDP is an MDP on the Belief Space

SARSOP can solve some POMDPs with thousands of states offline

but

The POMDP is PSPACE-Complete

Intractable!

Online Tree Search in MDPs

Time

Estimate $$Q(s, a)$$ based on children

$$Q(s,a) = R(s, a) + \gamma E\Big[V^* (s_{t+1}) \mid s_t = s, a_t=a\Big]$$

$V(s) = \max_a Q(s,a)$

## Sparse Sampling

Expand for all actions ($$\left|\mathcal{A}\right| = 2$$ in this case)

...

Expand for all $$\left|\mathcal{S}\right|$$ states

$$C=3$$ states

## Sparse Sampling

...

1. Near-optimal policy: $$\left|V^A(s) - V^*(s) \right|\leq \epsilon$$

2. Running time independent of state space size:

$$O \left( ( \left|\mathcal{A} \right|C )^H \right)$$

• A POMDP is an MDP on the Belief Space but belief updates are expensive
• Each belief is implicitly represented by a collection of unweighted particles

*(Partially Observable Monte Carlo Planning)

Fails in Continuous Observation Spaces

POMCP

POMCP-DPW

POMCPOW

MDP trained on normal drivers

MDP trained on all drivers

Omniscient

POMCPOW (Ours)

Simulation results

### Continuous Observation Analytical Results (POWSS)

Our simplified algorithm is near-optimal

Conventional 1D POMDP

2D POMDP

## POMDP Planning with Image Observations

### POMDPs with Continuous...

• PO-UCT (POMCP)
• DESPOT
• POMCPOW
• DESPOT-α
• LABECOP
• GPS-ABT
• VG-MCTS
• BOMCP
• VOMCPOW

## BOMCP

## Voronoi Progressive Widening

# Games

## Interaction Uncertainty

## Space Domain Awareness Games

1

2

...

...

...

...

...

...

...

$$N$$

# Open Source Software

POMDPs.jl - An interface for defining and solving MDPs and POMDPs in Julia

## Challenges for POMDP Software

1. There is a huge variety of
• Problems
• Continuous/Discrete
• Fully/Partially Observable
• Generative/Explicit
• Simple/Complex
• Solvers
• Online/Offline
• Alpha Vector/Graph/Tree
• Exact/Approximate
• Domain-specific heuristics
2. POMDPs are computationally difficult.

Explicit

Black Box

("Generative" in POMDP lit.)

$$s,a$$

$$s', o, r$$

Previous C++ framework: APPL

"At the moment, the three packages are independent. Maybe one day they will be merged in a single coherent framework."

Celeste Project

1.54 Petaflops

# Recent and Current Projects

I(t) = \int_0^\infty I(t-\tau)\beta(\tau)d\tau

## COVID POMDP

Individual Infectiousness

Infection Age

Incident Infections

\beta(\tau)
\tau
I
I(t) = \int_0^\infty I(t-\tau)\beta(\tau)d\tau
\beta(\tau)

Need

Test sensitivity is secondary to frequency and turnaround time for COVID-19 surveillance

Larremore et al.

Viral load represented by piecewise-linear hinge function

(t_0, 3)
(t_{\text{peak}}, V_{\text{peak}})
(t_f,6)
t_0 \sim \mathcal{U}[2.5,3.5]
t_\text{peak} - t_0 \sim 0.2 + \text{Gamma}(1.8)
V_\text{peak} \sim \mathcal{U}[7,11]
t_f - t_\text{peak} \sim \mathcal{U}[5,10]
c_I = 100.0 \\ c_T = 1.0 \\ c_{TR} = 10.0

## MPC for Intermittent Rotor Failures

