Towards Tree Search in Partially Observable Stochastic Games
Assistant Professor Zachary Sunberg
University of Colorado Boulder
October 8th, 2025
PI: Prof. Zachary Sunberg
PhD Students
Postdoc
Aleatory
Epistemic (Static)
Epistemic (Dynamic)
Interaction
MDP
RL
POMDP
Game
Markov Decision Process (MDP)
Aleatory
\([x, y, z,\;\; \phi, \theta, \psi,\;\; u, v, w,\;\; p,q,r]\)
\(\mathcal{S} = \mathbb{R}^{12}\)
\(\mathcal{S} = \mathbb{R}^{12} \times \mathbb{R}^\infty\)
\[\underset{\pi:\, \mathcal{S} \to \mathcal{A}}{\text{maximize}} \quad \text{E}\left[ \sum_{t=0}^\infty R(s_t, a_t) \right]\]
Reinforcement Learning
Aleatory
Epistemic (Static)
\([x, y, z,\;\; \phi, \theta, \psi,\;\; u, v, w,\;\; p,q,r]\)
\(\mathcal{S} = \mathbb{R}^{12}\)
\(\mathcal{S} = \mathbb{R}^{12} \times \mathbb{R}^\infty\)
Partially Observable Markov Decision Process (POMDP)
Aleatory
Epistemic (Static)
Epistemic (Dynamic)
\([x, y, z,\;\; \phi, \theta, \psi,\;\; u, v, w,\;\; p,q,r]\)
\(\mathcal{S} = \mathbb{R}^{12}\)
\(\mathcal{S} = \mathbb{R}^{12} \times \mathbb{R}^\infty\)
Aleatory
Epistemic (Static)
Epistemic (Dynamic)
Interaction
State
Timestep
Accurate Observations
Goal: \(a=0\) at \(s=0\)
Optimal Policy
Localize
\(a=0\)
Environment
Belief Updater
Planner
\(a = +10\)
True State
\(s = 7\)
Observation \(o = -0.21\)
\(b\)
\[b_t(s) = P\left(s_t = s \mid b_0, a_0, o_1 \ldots a_{t-1}, o_{t}\right)\]
\[ = P\left(s_t = s \mid b_{t-1}, a_{t-1}, o_{t}\right)\]
\(Q(b, a)\)
\(O(|\mathcal{S}|^2)\)
Tree size: \(O\left(\left(|A||O|\right)^D\right)\)
\(d\) dimensions, \(k\) segments \(\,\rightarrow \, |S| = k^d\)
1 dimension
e.g. \(s = x \in S = \{1,2,3,4,5\}\)
\(|S| = 5\)
2 dimensions
e.g. \(s = (x,y) \in S = \{1,2,3,4,5\}^2\)
\(|S| = 25\)
3 dimensions
e.g. \(s = (x,y,x_h) \in S = \{1,2,3,4,5\}^3\)
\(|S| = 125\)
(Discretize each dimension into 5 segments)
\(x\)
\(y\)
\(x_h\)
Find \(\underset{s\sim b}{E}[f(s)]\)
\[=\sum_{s \in S} f(s) b(s)\]
Monte Carlo Integration
\(Q_N \equiv \frac{1}{N} \sum_{i=1}^N f(s_i)\)
\(s_i \sim b\) i.i.d.
\(\text{Var}(Q_N) = \text{Var}\left(\frac{1}{N} \sum_{i=1}^N f(s_i)\right)\)
\(= \frac{1}{N^2} \sum_{i=1}^N\text{Var}\left(f(s_i)\right)\)
\(= \frac{1}{N} \text{Var}\left(f(s_i)\right)\)
\[P(|Q_N - E[f(s_i)]| \geq \epsilon) \leq \frac{\text{Var}(f(s_i))}{N \epsilon^2}\]
(Bienayme)
(Chebyshev)
Curse of dimensionality!
Expand for all actions (\(\left|\mathcal{A}\right| = 2\) in this case)
...
Expand for all \(\left|\mathcal{S}\right|\) states
\(C=3\) states
\[b(s) \approx \sum_{i=1}^N \delta_{s}(s_i)\; w_i\]
[Sunberg and Kochenderfer, ICAPS 2018, T-ITS 2022]
How do we prove convergence?
\[|Q_{\mathbf{P}}^*(b,a) - Q_{\mathbf{M}_{\mathbf{P}}}^*(\bar{b},a)| \leq \epsilon \quad \text{w.p. } 1-\delta\]
For any \(\epsilon>0\) and \(\delta>0\), if \(C\) (number of particles) is high enough,
[Lim, Becker, Kochenderfer, Tomlin, & Sunberg, JAIR 2023]
No direct dependence on \(|\mathcal{S}|\) or \(|\mathcal{O}|\)!
\(C\) is too large for any direct safety guarantees. But, in practice, works extremely well for improving efficiency.
[Lim, Becker, Kochenderfer, Tomlin, & Sunberg, JAIR 2023]
Tree size: \(O\left(\left(|A|C\right)^D\right)\)
Solve simplified surrogate problem for policy deep in the tree
[Lim, Tomlin, and Sunberg, 2021]
Policy Network
Value Network
Alpha Zero
1. Exploration
2. Selection
20 steps of regret matching on \(\tilde{A}\)
3. Networks
Policy trained to match solution to \(\bar{A}\)
Value distribution trained on sim outcomes
[Becker & Sunberg, AMOS 2025]
Image: Russel & Norvig, AI, a modern approach
P1: A
P1: K
P2: A
P2: A
P2: K
POMDP
Extensive Form Game
Image: Russel & Norvig, AI, a modern approach
P1: A
P1: K
P2: A
P2: A
P2: K
Conditional Distribution InfoSet Tree (CDIT)
Read Martin Schmid's thesis! https://arxiv.org/pdf/2111.05884
Finding Policies for Zero-Sum Games
Counterfactual Action Utilities
Estimate regret through external sampling
Update policies with regret matching (ESCFR)
[Becker & Sunberg, AAMAS 2025]
[Becker & Sunberg, AAMAS 2025]
Will CDITs help solve for the best LaserTag evasion policy?
POMDP
POSG
Environment
Belief Updater
Planner
Environment
Planner
P2
What goes here?
Martin Schmid's thesis: https://arxiv.org/pdf/2111.05884
Funding orgs: (all opinions are my own)
VADeR
PI: Prof. Zachary Sunberg
PhD Students
Postdoc