Breaking the Curse of Dimensionality in Decision-Making for Autonomous Systems
Assistant Professor Zachary Sunberg
University of Colorado Boulder
September 6th, 2024
PI: Prof. Zachary Sunberg
PhD Students
Postdoc
Two Objectives for Autonomy
Minimize resource use
(especially time)
Minimize the risk of harm to oneself and others
Safety often opposes Efficiency
Tweet by Nitin Gupta
29 April 2018
https://twitter.com/nitguptaa/status/990683818825736192
Video: Eric Frew
Video: Eric Frew
Video: Eric Frew
Driving: what are the other road users going to do?
Tornado Forecasting: what is going on in the storm?
Search and Rescue: where is the lost person?
All are sequential decision-making problems with uncertainty!
All can be modeled as a POMDP (with a very large state and observation spaces).
Aleatory
Epistemic (Static)
Epistemic (Dynamic)
Interaction
MDP
RL
POMDP
Game
Markov Decision Process (MDP)
Aleatory
\([x, y, z,\;\; \phi, \theta, \psi,\;\; u, v, w,\;\; p,q,r]\)
\(\mathcal{S} = \mathbb{R}^{12}\)
\(\mathcal{S} = \mathbb{R}^{12} \times \mathbb{R}^\infty\)
\[\underset{\pi:\, \mathcal{S} \to \mathcal{A}}{\text{maximize}} \quad \text{E}\left[ \sum_{t=0}^\infty R(s_t, a_t) \right]\]
Reinforcement Learning
Aleatory
Epistemic (Static)
\([x, y, z,\;\; \phi, \theta, \psi,\;\; u, v, w,\;\; p,q,r]\)
\(\mathcal{S} = \mathbb{R}^{12}\)
\(\mathcal{S} = \mathbb{R}^{12} \times \mathbb{R}^\infty\)
Partially Observable Markov Decision Process (POMDP)
Aleatory
Epistemic (Static)
Epistemic (Dynamic)
\([x, y, z,\;\; \phi, \theta, \psi,\;\; u, v, w,\;\; p,q,r]\)
\(\mathcal{S} = \mathbb{R}^{12}\)
\(\mathcal{S} = \mathbb{R}^{12} \times \mathbb{R}^\infty\)
State
Timestep
Accurate Observations
Goal: \(a=0\) at \(s=0\)
Optimal Policy
Localize
\(a=0\)
Environment
Belief Updater
Planner
\(a = +10\)
True State
\(s = 7\)
Observation \(o = -0.21\)
\(b\)
\[b_t(s) = P\left(s_t = s \mid b_0, a_0, o_1 \ldots a_{t-1}, o_{t}\right)\]
\[ = P\left(s_t = s \mid b_{t-1}, a_{t-1}, o_{t}\right)\]
\(Q(b, a)\)
\(O(|\mathcal{S}|^2)\)
Online Tree Search in MDPs
Time
Estimate \(Q(s, a)\) based on children
Environment
Belief Updater
Policy/Planner
\(b\)
\(a\)
\[b_t(s) = P\left(s_t = s \mid b_0, a_0, o_1 \ldots a_{t-1}, o_{t}\right)\]
True State
\(s = 7\)
Observation \(o = -0.21\)
\(O(|S|^2)\)
\[ = P\left(s_t = s \mid b_{t-1}, a_{t-1}, o_{t}\right)\]
Environment
Policy/Planner
\(a\)
True State
\(s = 7\)
Observation \(o = -0.21\)
Optimal planners need to consider the entire history
\(h_t = (b_0, a_0, o_1, a_1, o_2 \ldots a_{t-1}, o_{t})\)
POMDP \((S, A, T, R, O, Z)\) is equivalent to MDP \((S', A', T', R')\)
One new continuous state dimension for each state in \(S\)!
Tree size: \(O\left(\left(|A||O|\right)^D\right)\)
\(d\) dimensions, \(k\) segments \(\,\rightarrow \, |S| = k^d\)
1 dimension
e.g. \(s = x \in S = \{1,2,3,4,5\}\)
\(|S| = 5\)
2 dimensions
e.g. \(s = (x,y) \in S = \{1,2,3,4,5\}^2\)
\(|S| = 25\)
3 dimensions
e.g. \(s = (x,y,x_h) \in S = \{1,2,3,4,5\}^3\)
\(|S| = 125\)
(Discretize each dimension into 5 segments)
\(x\)
\(y\)
\(x_h\)
Find \(\underset{s\sim b}{E}[f(s)]\)
\[=\sum_{s \in S} f(s) b(s)\]
Monte Carlo Integration
\(Q_N \equiv \frac{1}{N} \sum_{i=1}^N f(s_i)\)
\(s_i \sim b\) i.i.d.
\(\text{Var}(Q_N) = \text{Var}\left(\frac{1}{N} \sum_{i=1}^N f(s_i)\right)\)
\(= \frac{1}{N^2} \sum_{i=1}^N\text{Var}\left(f(s_i)\right)\)
\(= \frac{1}{N} \text{Var}\left(f(s_i)\right)\)
\[P(|Q_N - E[f(s_i)]| \geq \epsilon) \leq \frac{\text{Var}(f(s_i))}{N \epsilon^2}\]
(Bienayme)
(Chebyshev)
Curse of dimensionality!
\[b(s) \approx \sum_{i=1}^N \delta_{s}(s_i)\; w_i\]
[Sunberg and Kochenderfer, ICAPS 2018, T-ITS 2022]
How do we prove convergence?
Continuous \(S\), \(O\); Discrete \(A\)
No Dirac-delta observation densities
Bounded Reward
Generative model for \(T\); Explicit model for \(Z\)
Finite Horizon
Only reasonable beliefs
\(\mathcal{P}\): state distribution conditioned on observations (belief)
\(\mathcal{Q}\): marginal state distribution (proposal)
Expand for all actions (\(\left|\mathcal{A}\right| = 2\) in this case)
...
Expand for all \(\left|\mathcal{S}\right|\) states
\(C=3\) states
\[|Q_{\mathbf{P}}^*(b,a) - Q_{\mathbf{M}_{\mathbf{P}}}^*(\bar{b},a)| \leq \epsilon \quad \text{w.p. } 1-\delta\]
For any \(\epsilon>0\) and \(\delta>0\), if \(C\) (number of particles) is high enough,
[Lim, Becker, Kochenderfer, Tomlin, & Sunberg, JAIR 2023]
No direct dependence on \(|\mathcal{S}|\) or \(|\mathcal{O}|\)!
\(C\) is too large for any direct safety guarantees. But, in practice, works extremely well for improving efficiency.
[Lim, Becker, Kochenderfer, Tomlin, & Sunberg, JAIR 2023]
Tree size: \(O\left(\left(|A|C\right)^D\right)\)
Solve simplified surrogate problem for policy deep in the tree
[Lim, Tomlin, and Sunberg, 2021]
POMDP Formulation
\(s=\left(x, y, \dot{x}, \left\{(x_c,y_c,\dot{x}_c,l_c,\theta_c)\right\}_{c=1}^{n}\right)\)
\(o=\left\{(x_c,y_c,\dot{x}_c,l_c)\right\}_{c=1}^{n}\)
\(a = (\ddot{x}, \dot{y})\), \(\ddot{x} \in \{0, \pm 1 \text{ m/s}^2\}\), \(\dot{y} \in \{0, \pm 0.67 \text{ m/s}\}\)
Ego external state
External states of other cars
Internal states of other cars
External states of other cars
Efficiency
Safety
MDP trained on normal drivers
MDP trained on all drivers
Omniscient
POMCPOW (Ours)
Simulation results
[Sunberg & Kochenderfer, T-ITS 2023]
[Gupta, Hayes, & Sunberg, AAMAS 2022]
Previous solution: 1-D POMDP (92s avg)
Our solution (65s avg)
State:
Conventional 1DOF POMDP
Multi-DOF POMDP
[Gupta, Hayes, & Sunberg, AAMAS 2021]
State:
Baseline
Our POMDP Planner
[Ray, Laouar, Sunberg, & Ahmed, ICRA 2023]
[Ray, Laouar, Sunberg, & Ahmed, ICRA 2023]
(Result for simplified dynamical system)
State:
Innovation: Large language models allow analysts to quickly specify anomaly hypotheses
Catalog Maintenance Plan
Three Contributions
State:
[Ho et al., UAI 24], [Ho, Feather, Rossi, Sunberg, & Lahijanian, UAI 24], [Ho, Sunberg, & Lahijanian, ICRA 22]
Calculate Outcomes
Calculate Weight Update
\( \frac{\epsilon - \alpha_a \cdot \Delta \mu_{h-a}}{\Delta\mu_{h-j} \cdot \Delta \mu_{h-a}} \Delta\mu_{h-j}\)
Estimate Weight with Update
\( \alpha[2]\)
\( \alpha[1]\)
\( a_{h}\) - optimal
\( a_{a}\) - optimal
\( \alpha_{h}\)
\( \alpha_{a}\)
\( R(s,a) = \alpha \cdot \boldsymbol{\phi}(s,a)\)
\(a_a\) outcomes: \(\mu_a\)
\(a_h\) outcomes: \(\mu_h\)
\(a_a\)
\(a_h\)
\( \hat{\alpha}_{h}\)
\(a_h\)
\(a_a\)
[Kraske, Saksena, Buczak, & Sunberg, ICAA 2024]
Partially Observable Markov Decision Process (POMDP)
Aleatory
Epistemic (Static)
Epistemic (Dynamic)
Aleatory
Epistemic (Static)
Epistemic (Dynamic)
Interaction
Nash Equilibrium: All players play a best response.
Optimization Problem
(MDP or POMDP)
\(\text{maximize} \quad f(x)\)
Game
Player 1: \(U_1 (a_1, a_2)\)
Player 2: \(U_2 (a_1, a_2)\)
Collision
Example: Airborne Collision Avoidance
|
|
|
Player 1
Player 2
Up
Down
Up
Down
-6, -6
-1, 1
1, -1
-4, -4
Collision
Nash Equilibrium \(\iff\) Zero Exploitability
\[\sum_i \max_{\pi_i'} U_i(\pi_i', \pi_{-i})\]
No Pure Nash Equilibrium!
Instead, there is a Mixed Nash where each player plays up or down with 50% probability.
If either player plays up or down more than 50% of the time, their strategy can be exploited.
Exploitability (zero sum):
Strategy (\(\pi_i\)): probability distribution over actions
|
|
|
Up
Down
Up
Down
-1, 1
1, -1
1, -1
-1, 1
Collision
Collision
POMDP Solution:
Nash equilibrium: All players play a best response to the other players
Fundamentally impossible for POMDP solvers to compute.
May include stochastic behavior (bluffing)
A shrewd missile operator will use different actions, invalidating our belief
Every maneuver involves tradeoffs
1
2
...
...
...
...
...
...
...
\(N\)
[Becker & Sunberg, AMOS 2022]
[Becker & Sunberg, AMOS 2022]
[Becker & Sunberg, AMOS 2022]
[Becker & Sunberg, AMOS 2022]
Image: Russel & Norvig, AI, a modern approach
P1: A
P1: K
P2: A
P2: A
P2: K
[Becker & Sunberg, NeurIPS 2024 (Under Review)]
[Becker & Sunberg, NeurIPS 2024 (Under Review)]
Funding orgs: (all opinions are my own)
VADeR
PI: Prof. Zachary Sunberg
PhD Students
Postdoc
Good Examples
Explicit
Black Box
("Generative" in POMDP lit.)
\(s,a\)
\(s', o, r\)
Previous C++ framework: APPL
"At the moment, the three packages are independent. Maybe one day they will be merged in a single coherent framework."
C++
Python, C++
Python, Matlab
Python, Matlab
Python, C++
We love [Matlab, Lisp, Python, Ruby, Perl, Mathematica, and C]; they are wonderful and powerful. For the work we do — scientific computing, machine learning, data mining, large-scale linear algebra, distributed and parallel computing — each one is perfect for some aspects of the work and terrible for others. Each one is a trade-off.
We are greedy: we want more.
POMDPs.jl - An interface for defining and solving MDPs and POMDPs in Julia
partially_observable_mountaincar = QuickPOMDP(
actions = [-1., 0., 1.],
obstype = Float64,
discount = 0.95,
initialstate = ImplicitDistribution(rng -> (-0.2*rand(rng), 0.0)),
isterminal = s -> s[1] > 0.5,
gen = function (s, a, rng)
x, v = s
vp = clamp(v + a*0.001 + cos(3*x)*-0.0025, -0.07, 0.07)
xp = x + vp
if xp > 0.5
r = 100.0
else
r = -1.0
end
return (sp=(xp, vp), r=r)
end,
observation = (a, sp) -> Normal(sp[1], 0.15)
)
using POMDPs
using QuickPOMDPs
using POMDPPolicies
using Compose
import Cairo
using POMDPGifs
import POMDPModelTools: Deterministic
mountaincar = QuickMDP(
function (s, a, rng)
x, v = s
vp = clamp(v + a*0.001 + cos(3*x)*-0.0025, -0.07, 0.07)
xp = x + vp
if xp > 0.5
r = 100.0
else
r = -1.0
end
return (sp=(xp, vp), r=r)
end,
actions = [-1., 0., 1.],
initialstate = Deterministic((-0.5, 0.0)),
discount = 0.95,
isterminal = s -> s[1] > 0.5,
render = function (step)
cx = step.s[1]
cy = 0.45*sin(3*cx)+0.5
car = (context(), circle(cx, cy+0.035, 0.035), fill("blue"))
track = (context(), line([(x, 0.45*sin(3*x)+0.5) for x in -1.2:0.01:0.6]), stroke("black"))
goal = (context(), star(0.5, 1.0, -0.035, 5), fill("gold"), stroke("black"))
bg = (context(), rectangle(), fill("white"))
ctx = context(0.7, 0.05, 0.6, 0.9, mirror=Mirror(0, 0, 0.5))
return compose(context(), (ctx, car, track, goal), bg)
end
)
energize = FunctionPolicy(s->s[2] < 0.0 ? -1.0 : 1.0)
makegif(mountaincar, energize; filename="out.gif", fps=20)
[Deglurkar, Lim, Sunberg, & Tomlin, 2023]
[Mern, Sunberg, et al. AAAI 2021]
[Lim, Tomlin, & Sunberg CDC 2021]
Human Behavior Model: IDM and MOBIL
M. Treiber, et al., “Congested traffic states in empirical observations and microscopic simulations,” Physical Review E, vol. 62, no. 2 (2000).
A. Kesting, et al., “General lane-changing model MOBIL for car-following models,” Transportation Research Record, vol. 1999 (2007).
A. Kesting, et al., "Agents for Traffic Simulation." Multi-Agent Systems: Simulation and Applications. CRC Press (2009).
All drivers normal
Omniscient
Mean MPC
QMDP
POMCPOW