Julia 5280

How Julia Makes New Decision-Making AI Possible

Zachary Sunberg, PhD

Assistant Professor

University of Colorado Boulder

Sequential Decision Making under Uncertainty

Waymo Image By Dllu - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=64517567

Markov Model

$\mathcal{S}$ - State space
$T:\mathcal{S}\times\mathcal{S} \to \mathbb{R}$ - Transition probability distributions

Markov Decision Process (MDP)

$\mathcal{S}$ - State space
$T:\mathcal{S}\times \mathcal{A} \times\mathcal{S} \to \mathbb{R}$ - Transition probability distribution
$\mathcal{A}$ - Action space
$R:\mathcal{S}\times \mathcal{A} \to \mathbb{R}$ - Reward

Solving MDPs - The Value Function

$$V^*(s) = \underset{a\in\mathcal{A}}{\max} \left\{R(s, a) + \gamma E\Big[V^*\left(s_{t+1}\right) \mid s_t=s, a_t=a\Big]\right\}$$

Involves all future time

Involves only $t$ and $t+1$

$$\underset{\pi:\, \mathcal{S}\to\mathcal{A}}{\mathop{\text{maximize}}} \, V^\pi(s) = E\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, \pi(s_t)) \bigm| s_0 = s \right]$$

$$Q(s,a) = R(s, a) + \gamma E\Big[V^* (s_{t+1}) \mid s_t = s, a_t=a\Big]$$

Value = expected sum of future rewards

Online Decision Process Tree Approaches

Time

Estimate $Q(s, a)$ based on children

$$Q(s,a) = R(s, a) + \gamma E\Big[V^* (s_{t+1}) \mid s_t = s, a_t=a\Big]$$

\[V(s) = \max_a Q(s,a)\]

Partially Observable Markov Decision Process (POMDP)

$\mathcal{S}$ - State space
$T:\mathcal{S}\times \mathcal{A} \times\mathcal{S} \to \mathbb{R}$ - Transition probability distribution
$\mathcal{A}$ - Action space
$R:\mathcal{S}\times \mathcal{A} \to \mathbb{R}$ - Reward
$\mathcal{O}$ - Observation space
$Z:\mathcal{S} \times \mathcal{A}\times \mathcal{S} \times \mathcal{O} \to \mathbb{R}$ - Observation probability distribution

POMDP Sense-Plan-Act Loop

Environment

Belief Updater

Planner

$o$

$b$

$a$

Aggressive: 63%

Normal: 34%

Timid: 3%

$x, y, v$

Turn Left

2011

2013

2014

Also ~2011: Improving TCAS

ACAS X

POMDP Models

Optimization

Specification

POMDPs.jl - An interface for defining and solving MDPs and POMDPs in Julia

[Egorov, Sunberg, et al., 2017]

Challenges for POMDP Software

POMDPs are computationally difficult.

Julia - Speed

Celeste Project

1.54 Petaflops

Challenges for POMDP Software

POMDPs are computationally difficult.
There is a huge variety of
- Problems
  - Continuous/Discrete
  - Fully/Partially Observable
  - Generative/Explicit
  - Simple/Complex
- Solvers
  - Online/Offline
  - Alpha Vector/Graph/Tree
  - Exact/Approximate
- Domain-specific heuristics

Explicit

Black Box

("Generative" in POMDP lit.)

$s,a$

$s', o, r$

Previous C++ framework: APPL

"At the moment, the three packages are independent. Maybe one day they will be merged in a single coherent framework."

[Egorov, Sunberg, et al., 2017]

A POMDP is an MDP on the Belief Space but belief updates are expensive
POMCP* uses simulations of histories instead of full belief updates
Each belief is implicitly represented by a collection of unweighted particles

[Ross, 2008] [Silver, 2010]

*(Partially Observable Monte Carlo Planning)

POMCP

POMCP-DPW

POMCPOW

General Sum Differential Games

Joint Dynamics:
$$\dot{x} = f(t, x, u_1, \ldots, u_N)$$

Cost for player $i$:
$$J_i = \int_0^T{g_i(t, x, u_1, \ldots, u_N) dt}$$

Strategy of player $i$:
$$u_i(t) = \gamma_i(t, x)$$

Continuous Action Spaces

(sp, r), back = pullback((s,a)->@gen(:sp,:r)(m, s, a, rng), s, a)

Acknowledgements

The content of my research reflects my opinions and conclusions, and is not necessarily endorsed by my funding organizations.

Thank You!

zachary.sunberg.net

POMDP Sense-Plan-Act Loop

Environment

Belief Updater

Policy

$o$

$b$

$a$

\[b_t(s) = P\left(s_t = s \mid a_1, o_1 \ldots a_{t-1}, o_{t-1}\right)\]

Laser Tag POMDP

Types of Uncertainty

ALEATORY

MODEL (Epistemic, Static)

STATE (Epistemic, Dynamic)