Safe and efficient autonomy in the face of uncertainty

Professor Zachary Sunberg

University of Colorado Boulder

Fall 2022

Background

Mission: deploy autonomy with confidence

Waymo Image By Dllu - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=64517567

Two Objectives for Autonomy

EFFICIENCY

SAFETY

Minimize resource use

(especially time)

Minimize the risk of harm to oneself and others

Safety often opposes Efficiency

Tweet by Nitin Gupta

29 April 2018

https://twitter.com/nitguptaa/status/990683818825736192

Safety, Efficiency, and Uncertainty: A Story

Why?

Only one safety procedure: Don't approach the vehicle for 10 minutes after a crash (in case it explodes)

Efficiency: Fly planes

Safety: Avoid fires

Solution: gather information (reduce uncertainty) about potentially unsafe situations

Optimization

Safety

Better Performance

Model $M_2$, Algorithm $A_2$

Model $M_1$, Algorithm $A_1$

Efficiency

$$\underset{\pi}{\mathop{\text{maximize}}} \, V^\pi = V^\pi_\text{E} + \lambda V^\pi_\text{S}$$

Safety

Weight

Efficiency

Types of Uncertainty

Alleatory

Epistemic (Static)

Epistemic (Dynamic)

Interaction

MDP

POMDP

Game

Markov Decision Process (MDP)

$\mathcal{S}$ - State space
$T(s' \mid s, a)$ - Transition probability distributions
$\mathcal{A}$ - Action space
$R(s, a)$ - Reward function

Alleatory

Reinforcement Learning

Alleatory

Epistemic (Static)

$\mathcal{S}$ - State space
$T(s' \mid s, a)$ - Transition probability distributions
$\mathcal{A}$ - Action space
$R(s, a)$ - Reward function

Partially Observable Markov Decision Process (POMDP)

Alleatory

Epistemic (Static)

Epistemic (Dynamic)

$\mathcal{S}$ - State space
$T(s' \mid s, a)$ - Transition probability distribution
$\mathcal{A}$ - Action space
$R(s, a)$ - Reward function
$\mathcal{O}$ - Observation space
$Z(o \mid a, s')$ - Observation probability distribution

Partially Observable Markov Game

Alleatory

Epistemic (Static)

Epistemic (Dynamic)

Interaction

$\mathcal{S}$ - State space
$T(s' \mid s, \bm{a})$ - Transition probability distribution
$\mathcal{A}^i, \, i \in 1..k$ - Action spaces
$R^i(s, \bm{a})$ - Reward function
$\mathcal{O}^i, \, i \in 1..k$ - Observation space
$Z(o^i \mid \bm{a}, s')$ - Observation probability distribution

POMDPs in Aerospace

1) ACAS

2) Orbital Object Tracking

4) Asteroid Navigation

3) Dual Control

ACAS X

Trusted UAV

Collision Avoidance

[Sunberg, 2016]

[Kochenderfer, 2011]

5) Weather Science

POMDPs in Aerospace

$\mathcal{S}$: Information space for all objects

$\mathcal{A}$: Which objects to measure

$R$: - Entropy

Approximately 20,000 objects >10cm in orbit

[Sunberg, 2016]

1) ACAS

2) Orbital Object Tracking

4) Asteroid Navigation

3) Dual Control

5) Weather Science

POMDPs in Aerospace

State $x$ Parameters $\theta$

$s = (x, \theta)$ $o = x + v$

POMDP solution automatically balances exploration and exploitation

[Slade, Sunberg, et al. 2017]

1) ACAS

2) Orbital Object Tracking

4) Asteroid Navigation

3) Dual Control

5) Weather Science

POMDPs in Aerospace

Dynamics: Complex gravity field, regolith

State: Vehicle state, local landscape

Sensor: Star tracker?, camera?, accelerometer?

Action: Hopping actuator

[Hockman, 2017]

1) ACAS

2) Orbital Object Tracking

4) Asteroid Navigation

3) Dual Control

5) Weather Science

POMDPs in Aerospace

1) ACAS

2) Orbital Object Tracking

4) Asteroid Navigation

3) Dual Control

5) Weather Science

Solving MDPs - The Value Function

$$V^*(s_t) = \underset{a\in\mathcal{A}}{\max} \left\{R(s_t, a_t) + \gamma E\Big[V^*\left(s_{t+1}\right) \Big]\right\}$$

Involves all future time

Involves only $t$ and $t+1$

$$\underset{\pi:\, \mathcal{S}\to\mathcal{A}}{\mathop{\text{maximize}}} \, V^\pi(s) = E\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, a_t)\right]$$

$$Q(s,a) = R(s, a) + \gamma E\Big[V^* (s_{t+1})\Big]$$

Value = expected sum of future rewards

Online Tree Search in MDPs

Time

Estimate $Q(s, a)$ based on children

Challenge: Curse of dimensionality

MDP

POMDP

Inc. Info. Game

Curse of Dimensionality

Immediate Reward: $R(s, a)$

Value Function: \[Q(s, a) = \text{E}\left[\sum_{t=0}^\infty R(s_t, a_t)\right]\]

$d$ dimensions, $k$ segments $\,\rightarrow \, |S| = k^d$

Curse of Dimensionality

$d$ dimensions, $k$ segments $\,\rightarrow \, |\mathcal{S}| = k^d$

1 dimension, 5 segments

$|\mathcal{S}| = 5$

2 dimensions, 5 segments

$|\mathcal{S}| = 25$

3 dimensions, 5 segments

$|\mathcal{S}| = 125$

Challenge: Curse of dimensionality

Adopted Solution: Online sparse tree search

MDP

POMDP

Inc. Info. Game

Online Tree Search in MDPs

Time

Estimate $Q(s, a)$ based on children

Sparse Sampling

Expand for all actions ($\left|\mathcal{A}\right| = 2$ in this case)

...

Expand for all $\left|\mathcal{S}\right|$ states

$C=3$ states

Sparse Sampling

...

[Kearns, Mansour, & Ng, 2002]

1. Near-optimal policy: $\left|V^\text{SS}(s) - V^*(s) \right|\leq \epsilon$

2. Running time independent of state space size:

$O \left( ( \left|\mathcal{A} \right|C )^H \right) $

Challenge: Curse of dimensionality

Adopted Solution: Online sparse tree search

Shortcoming: No active information gathering

Additional Challenge: Curse of history

MDP

POMDP

Inc. Info. Game

\begin{aligned} & \mathcal{S} = \mathbb{Z} \quad \quad \quad ~~ \mathcal{O} = \mathbb{R} \\ & s' = s+a \quad \quad o \sim \mathcal{N}(s, |s-10|) \\ & \mathcal{A} = \{-10, -1, 0, 1, 10\} \\ & R(s, a) = \begin{cases} 100 & \text{ if } a = 0, s = 0 \\ -100 & \text{ if } a = 0, s \neq 0 \\ -1 & \text{ otherwise} \end{cases} & \\ \end{aligned}

State

Timestep

Accurate Observations

Goal: $a=0$ at $s=0$

Optimal Policy

Localize

$a=0$

POMDP Example: Light-Dark

Autonomous Driving POMDP

POMDP Sense-Plan-Act Loop

Environment

Belief Updater

Policy/Planner

$b$

$a$

\[b_t(s) = P\left(s_t = s \mid a_1, o_1 \ldots a_{t-1}, o_{t-1}\right)\]

True State

Observation

A POMDP is an MDP on the Belief Space

SARSOP can solve some POMDPs with thousands of states offline

but

The POMDP is PSPACE-Complete

Intractable!

A POMDP is an MDP on the Belief Space but belief updates are expensive
POMCP* uses simulations of histories instead of full belief updates
Each belief is implicitly represented by a collection of unweighted particles

[Ross, 2008] [Silver, 2010]

*(Partially Observable Monte Carlo Planning)

REVISE SLIDE FOR FUTURE

POMCP

POMCP-DPW

POMCPOW

[Sunberg and Kochenderfer, ICAPS 2018]

First scalable algorithm for general POMDPs with continuous $O$

REMOVE SLIDE IN FUTURE

Challenge: Curse of dimensionality

Adopted Solution: Online sparse tree search

Shortcoming: No active information gathering

Additional Challenge: Curse of history

Adopted Solution: Particle filtering

MDP

POMDP

Inc. Info. Game

POMDP = Belief MDP

Efficient POMDP Approximations

\[|Q_{\mathbf{P}}^*(b,a) - Q_{\mathbf{M}_{\mathbf{P}}}^*(\bar{b},a)| \leq \epsilon \quad \text{w.p. } 1-\delta\]

For and $\epsilon>0$ and $\delta>0$, if $C$ (number of particles) is high enough,

$\mathbf{M}_\mathbf{P}$ = Particle belief MDP approximation of POMDP $\mathbf{P}$

[Lim, Becker, Kochenderfer, Tomlin, & Sunberg, 2022 (?)]

No dependence on $|\mathcal{S}|$ or $|\mathcal{O}|$!

Continuous Observation Analytical Results (POWSS)

Our simplified algorithm is near-optimal

[Lim, Tomlin, & Sunberg, IJCAI 2020]

Easy MDP to POMDP Extension

MDP trained on normal drivers

MDP trained on all drivers

Omniscient

POMCPOW (Ours)

Simulation results

[Sunberg & Kochenderfer, T-ITS 2022]

Conventional 1D POMDP

2D POMDP

Pedestrian Navigation

[Gupta, Hayes, & Sunberg, AAAI 2021]

POMDP Planning with Image Observations

[Deglurkar, Lim, Sunberg, & Tomlin, 2023?]

Additional ADCL (PO)MDP Projects

Motion planning under uncertainty with temporal logic specifications
Explainability
Adaptive Stress Testing

[Ho, Sunberg, and Lahijanian, ICRA 2022, ICRA 2023 (?)]

[Tucker, Wagner, and Sunberg, AMOS 22]

Actions

Observations

States

POMDPs with Continuous...

PO-UCT (POMCP)
DESPOT

POMCPOW
DESPOT-α
LABECOP
Ada-OPS

GPS-ABT
BOMCP
VOMCPOW

BOMCP

[Mern, Sunberg, et al. AAAI 2021]

Voronoi Progressive Widening

[Lim, Tomlin, & Sunberg CDC 2021]

Motion Planning under Uncertainty with Temporal Logic Specifications

[Ho, Sunberg, and Lahijanian, ICRA 2022, ICRA 2023 (?)]

Key Idea: Simplified Model with Belief Approximation (SiMBA)

Challenge: Curse of dimensionality

Adopted Solution: Online sparse tree search

Shortcoming: No active information gathering

Additional Challenge: Curse of history

Adopted Solution: Particle filtering

Shortcoming: No mixed strategies

MDP

POMDP

Inc. Info. Game

POMDP = Belief MDP

Interaction Uncertainty

[Peters, Tomlin, and Sunberg 2020]

Game Theory

Nash Equilibrium: All players play a best response.

Optimization Problem

(MDP or POMDP)

$\text{maximize} \quad f(x)$

Game

Player 1: $U_1 (a_1, a_2)$

Player 2: $U_2 (a_1, a_2)$

Collision

Example: Airborne Collision Avoidance

Player 1

Player 2

Down

-6, -6

-1, 1

1, -1

-4, -4

Collision

Mixed Strategies

Nash Equilibrium $\iff$ Zero Exploitability

\[\sum_i \max_{\pi_i'} U_i(\pi_i', \pi_{-i})\]

No Pure Nash Equilibrium!

Instead, there is a Mixed Nash where each player plays up or down with 50% probability.

If either player plays up or down more than 50% of the time, their strategy can be exploited.

Exploitability (zero sum):

Strategy ($\pi_i$): probability distribution over actions

Down

-1, 1

1, -1

-1, 1

Collision

Challenge: Curse of dimensionality

Adopted Solution: Online sparse tree search

Shortcoming: No active information gathering

Additional Challenge: Curse of history

Adopted Solution: Particle filtering

Shortcoming: No mixed strategies

MDP

POMDP

Inc. Info. Game

Additional Challenge: Computing Mixed strategies

Solution: ???

POMDP = Belief MDP

Best response = POMDP

Space Domain Awareness Games

Simplified SDA Game

...

$N$

[Becker & Sunberg CDC 2021]

Counterfactual Regret Minimization Training

Open Source Research Software

Challenges for POMDP Software

There is a huge variety of
- Problems
  - Continuous/Discrete
  - Fully/Partially Observable
  - Generative/Explicit
  - Simple/Complex
- Solvers
  - Online/Offline
  - Alpha Vector/Graph/Tree
  - Exact/Approximate
- Domain-specific heuristics
POMDPs are computationally difficult.

Explicit

Black Box

("Generative" in POMDP lit.)

$s,a$

$s', o, r$

Previous C++ framework: APPL

"At the moment, the three packages are independent. Maybe one day they will be merged in a single coherent framework."

Open Source Research Software

Performant
Flexible and Composable
Free and Open
Easy for a wide range of people to use (for homework)
Easy for a wide range of people to understand

C++

Python

Python, Matlab

C++

Python

Fast

Easy

2013

We love [Matlab, Lisp, Python, Ruby, Perl, Mathematica, and C]; they are wonderful and powerful. For the work we do — scientific computing, machine learning, data mining, large-scale linear algebra, distributed and parallel computing — each one is perfect for some aspects of the work and terrible for others. Each one is a trade-off.

We are greedy: we want more.

2012

Julia - Speed

Celeste Project

1.54 Petaflops

POMDPs.jl - An interface for defining and solving MDPs and POMDPs in Julia

Mountain Car

partially_observable_mountaincar = QuickPOMDP(
    actions = [-1., 0., 1.],
    obstype = Float64,
    discount = 0.95,
    initialstate = ImplicitDistribution(rng -> (-0.2*rand(rng), 0.0)),
    isterminal = s -> s[1] > 0.5,

    gen = function (s, a, rng)        
        x, v = s
        vp = clamp(v + a*0.001 + cos(3*x)*-0.0025, -0.07, 0.07)
        xp = x + vp
        if xp > 0.5
            r = 100.0
        else
            r = -1.0
        end
        return (sp=(xp, vp), r=r)
    end,

    observation = (a, sp) -> Normal(sp[1], 0.15)
)

using POMDPs
using QuickPOMDPs
using POMDPPolicies
using Compose
import Cairo
using POMDPGifs
import POMDPModelTools: Deterministic

mountaincar = QuickMDP(
    function (s, a, rng)        
        x, v = s
        vp = clamp(v + a*0.001 + cos(3*x)*-0.0025, -0.07, 0.07)
        xp = x + vp
        if xp > 0.5
            r = 100.0
        else
            r = -1.0
        end
        return (sp=(xp, vp), r=r)
    end,
    actions = [-1., 0., 1.],
    initialstate = Deterministic((-0.5, 0.0)),
    discount = 0.95,
    isterminal = s -> s[1] > 0.5,

    render = function (step)
        cx = step.s[1]
        cy = 0.45*sin(3*cx)+0.5
        car = (context(), circle(cx, cy+0.035, 0.035), fill("blue"))
        track = (context(), line([(x, 0.45*sin(3*x)+0.5) for x in -1.2:0.01:0.6]), stroke("black"))
        goal = (context(), star(0.5, 1.0, -0.035, 5), fill("gold"), stroke("black"))
        bg = (context(), rectangle(), fill("white"))
        ctx = context(0.7, 0.05, 0.6, 0.9, mirror=Mirror(0, 0, 0.5))
        return compose(context(), (ctx, car, track, goal), bg)
    end
)

energize = FunctionPolicy(s->s[2] < 0.0 ? -1.0 : 1.0)
makegif(mountaincar, energize; filename="out.gif", fps=20)