Breaking the Curse of Dimensionality in planning under uncertainty
Assistant Professor Zachary Sunberg
University of Colorado Boulder
Fall, 2024
PI: Prof. Zachary Sunberg
PhD Students
Postdoc
Tweet by Nitin Gupta
29 April 2018
https://twitter.com/nitguptaa/status/990683818825736192
Video: Eric Frew
Video: Eric Frew
Video: Eric Frew
Driving: what are the other drivers going to do?
Tornado Forecasting: what is going on in the storm?
Europa: what is the system and environment status?
All are sequential decision-making problems with uncertainty!
All can be modeled as a POMDP.
Aleatory
Epistemic (Static)
Epistemic (Dynamic)
Interaction
MDP
RL
POMDP
Game
Markov Decision Process (MDP)
Aleatory
\([x, y, z,\;\; \phi, \theta, \psi,\;\; u, v, w,\;\; p,q,r]\)
\(\mathcal{S} = \mathbb{R}^{12}\)
\[\underset{\pi:\, \mathcal{S} \to \mathcal{A}}{\text{maximize}} \quad \text{E}\left[ \sum_{t=0}^\infty R(s_t, a_t) \right]\]
Markov Decision Process (MDP)
Aleatory
\([x, y, z,\;\; \phi, \theta, \psi,\;\; u, v, w,\;\; p,q,r]\)
\(\mathcal{S} = \mathbb{R}^{12}\)
\(\mathcal{S} = \mathbb{R}^{12} \times \mathbb{R}^\infty\)
\[\underset{\pi:\, \mathcal{S} \to \mathcal{A}}{\text{maximize}} \quad \text{E}\left[ \sum_{t=0}^\infty R(s_t, a_t) \right]\]
Reinforcement Learning
Aleatory
Epistemic (Static)
\([x, y, z,\;\; \phi, \theta, \psi,\;\; u, v, w,\;\; p,q,r]\)
\(\mathcal{S} = \mathbb{R}^{12}\)
\(\mathcal{S} = \mathbb{R}^{12} \times \mathbb{R}^\infty\)
Partially Observable Markov Decision Process (POMDP)
Aleatory
Epistemic (Static)
Epistemic (Dynamic)
\([x, y, z,\;\; \phi, \theta, \psi,\;\; u, v, w,\;\; p,q,r]\)
\(\mathcal{S} = \mathbb{R}^{12}\)
\(\mathcal{S} = \mathbb{R}^{12} \times \mathbb{R}^\infty\)
State
Timestep
Environment
Belief Updater
True State
\(s = 7\)
Observation \(o = -0.21\)
\(b\)
\[b_t(s) = P\left(s_t = s \mid b_0, a_0, o_1 \ldots a_{t-1}, o_{t}\right)\]
\[ = P\left(s_t = s \mid b_{t-1}, a_{t-1}, o_{t}\right)\]
\(a\)
\[b_{t+1}(s') \propto Z(o \mid a, s')\int_{s \in \mathcal{S}} T(s' \mid s, a) b_t(s) ds\]
\(O(|\mathcal{S}|^2)\) for finite \(\mathcal{S}\)
State
Timestep
State
Timestep
Accurate Observations
Goal: \(a=0\) at \(s=0\)
Optimal Policy
Localize
\(a=0\)
Environment
Belief Updater
Planner
\(a = +10\)
True State
\(s = 7\)
Observation \(o = -0.21\)
\(b\)
\[b_t(s) = P\left(s_t = s \mid b_0, a_0, o_1 \ldots a_{t-1}, o_{t}\right)\]
\[ = P\left(s_t = s \mid b_{t-1}, a_{t-1}, o_{t}\right)\]
\(Q(b, a)\)
Tree size: \(O\left(\left(|A||O|\right)^D\right)\)
\(d\) dimensions, \(k\) segments \(\,\rightarrow \, |S| = k^d\)
1 dimension
e.g. \(s = x \in S = \{1,2,3,4,5\}\)
\(|S| = 5\)
2 dimensions
e.g. \(s = (x,y) \in S = \{1,2,3,4,5\}^2\)
\(|S| = 25\)
3 dimensions
e.g. \(s = (x,y,x_h) \in S = \{1,2,3,4,5\}^3\)
\(|S| = 125\)
(Discretize each dimension into 5 segments)
\(x\)
\(y\)
\(x_h\)
Find \(\underset{s\sim b}{E}[f(s)]\)
\[=\sum_{s \in S} f(s) b(s)\]
Monte Carlo Integration
\(Q_N \equiv \frac{1}{N} \sum_{i=1}^N f(s_i)\)
\(s_i \sim b\) i.i.d.
\(\text{Var}(Q_N) = \text{Var}\left(\frac{1}{N} \sum_{i=1}^N f(s_i)\right)\)
\(= \frac{1}{N^2} \sum_{i=1}^N\text{Var}\left(f(s_i)\right)\)
\(= \frac{1}{N} \text{Var}\left(f(s_i)\right)\)
\[P(|Q_N - E[f(s_i)]| \geq \epsilon) \leq \frac{\text{Var}(f(s_i))}{N \epsilon^2}\]
(Bienayme)
(Chebyshev)
Curse of dimensionality!
\[b(s) \approx \sum_{i=1}^N \delta_{s}(s_i)\; w_i\]
\[b_{t+1}(s') \propto Z(o \mid a, s')\int_{s \in \mathcal{S}} T(s' \mid s, a) b_t(s)\]
\(\implies\) Sample \(s'_i\) from \(T(s' | s_i, a)\),
\(w'_i \propto w_i \times Z(o \mid a, s'_i)\)
POMDP Formulation
\(s=\left(x, y, \dot{x}, \left\{(x_c,y_c,\dot{x}_c,l_c,\theta_c)\right\}_{c=1}^{n}\right)\)
\(o=\left\{(x_c,y_c,\dot{x}_c,l_c)\right\}_{c=1}^{n}\)
\(a = (\ddot{x}, \dot{y})\), \(\ddot{x} \in \{0, \pm 1 \text{ m/s}^2\}\), \(\dot{y} \in \{0, \pm 0.67 \text{ m/s}\}\)
Ego external state
External states of other cars
Internal states of other cars
External states of other cars
Efficiency
Safety
MDP trained on normal drivers
MDP trained on all drivers
Omniscient
POMCPOW (Ours)
Simulation results
[Sunberg & Kochenderfer, T-ITS 2023]
Convergence?
\(\mathcal{P}\): state distribution conditioned on observations (belief)
\(\mathcal{Q}\): marginal state distribution (proposal)
Expand for all actions (\(\left|\mathcal{A}\right| = 2\) in this case)
...
Expand for all \(\left|\mathcal{S}\right|\) states
\(C=3\) states
\[|Q_{\mathbf{P}}^*(b,a) - Q_{\mathbf{M}_{\mathbf{P}}}^*(\bar{b},a)| \leq \epsilon \quad \text{w.p. } 1-\delta\]
For any \(\epsilon>0\) and \(\delta>0\), if \(C\) (number of particles) is high enough,
[Lim, Becker, Kochenderfer, Tomlin, & Sunberg, JAIR 2023]
No direct dependence on \(|\mathcal{S}|\) or \(|\mathcal{O}|\)!
\(C\) is too large for any direct safety guarantees. But, in practice, works extremely well for improving efficiency.
[Lim, Becker, Kochenderfer, Tomlin, & Sunberg, JAIR 2023]
Tree size: \(O\left(\left(|A|C\right)^D\right)\)
Solve simplified surrogate problem for policy deep in the tree
[Lim, Tomlin, and Sunberg, 2021]
Three Contributions
State:
[Ho et al., UAI 24], [Ho, Feather, Rossi, Sunberg, & Lahijanian, UAI 24], [Ho, Sunberg, & Lahijanian, ICRA 22]
Evader strategy:
Move away from pursuer
Embedded in \(T(s' \mid s, a)\)
Aleatory
Epistemic (Static)
Epistemic (Dynamic)
Aleatory
Epistemic (Static)
Epistemic (Dynamic)
Interaction
Nash Equilibrium: All players play a best response.
Optimization Problem
\(\text{subject to} \quad g(x) \geq 0\)
\(\text{maximize} \quad f(x)\)
Game
Player 1: \(U_1 (a_1, a_2)\)
Player 2: \(U_2 (a_1, a_2)\)
Collision
Example: Airborne Collision Avoidance
|
|
|
Player 1
Player 2
Up
Down
Up
Down
-6, -6
-1, 1
1, -1
-4, -4
Collision
Nash Equilibrium \(\iff\) Zero Exploitability
\[\sum_i \max_{\pi_i'} U_i(\pi_i', \pi_{-i})\]
No Pure Nash Equilibrium!
Instead, there is a Mixed Nash where each player plays up or down with 50% probability.
If either player plays up or down more than 50% of the time, their strategy can be exploited.
Exploitability (zero sum):
Hypersonic Missile Defense (simplified)
|
|
|
Attacker
Defender
Up
Down
Up
Down
-1, 1
1, -1
1, -1
-1, 1
Collision
Collision
Strategy (\(\pi_i\)): probability distribution over actions
Image: Russel & Norvig, AI, a modern approach
P1: A
P1: K
P2: A
P2: A
P2: K
[Becker & Sunberg, In prep. for AAMAS '25]
Regret Matching
(External Sampling Counterfactual Regret Minimization)
Incentive to deviate makes a policy suboptimal
For a single agent:
For multiple agents:
Regret Matching
Average regret bounds deviation incentive
[Becker & Sunberg, in prep for AAMAS '25]
-1, -1 | -10, 0 | |
0, -10 | -5, -5 |
\(\sigma^1_1\)
\(\sigma^1_2\)
\(\ldots\)
\(\sigma^2_1\)
\(\sigma^2_2\)
\(\vdots\)
-1.01, -1.20 | -9.82, 0.12 | |
-0.10, -10.5 | -4.89, -5.02 |
\(\sigma^1_1\)
\(\sigma^1_2\)
\(\ldots\)
\(\sigma^2_1\)
\(\sigma^2_2\)
\(\vdots\)
Incentive to deviate in approximate game \(\hat{A}\)
Maximum value approximation error (\(E^i = A^i - \hat{A}^i \))
Incentive to deviate:
Funding orgs: (all opinions are my own)
VADeR
[Deglurkar, Lim, Sunberg, & Tomlin, 2023]