Hypothesis-Driven and Game-Theoretic Sensor Tasking
SURI IRT 3 Year 1 Report
Assistant Professor Zachary Sunberg
University of Colorado Boulder
Fall, 2024
PI: Prof. Zachary Sunberg
PhD Students
Postdoc
Deliverable
2 Papers Submitted to AAMAS
Dr. Ofer Dagan
Postdoctoral Associate
Tyler Becker
Doctoral Student
Object Of Interest (OOI)
Example Hypotheses:
Given an existing catalog maintenance plan and a set of hypotheses, how do we task sensors to gather data to distinguish between hypotheses with a minimum disruption?
\(X_{ijt}\): Observer \(j\) measures object \(i\) at time \(t\)
\(O_{ijt}\): Observability
\(r\): Measurements of least-measured
Optimal control in the information space
Solve with Monte Carlo Tree Search
Website: https://www.cu-adcl.org/SURI/
Fulfills "Basic Tasking Algorithm Implementation" deliverable requirement
Markov Decision Process (MDP)
Aleatory
\[\underset{\pi:\, \mathcal{S} \to \mathcal{A}}{\text{maximize}} \quad \text{E}\left[ \sum_{t=0}^\infty \gamma^t R(s_t, a_t) \right]\]
\(s = (x, y, z, \dot{x}, \dot{y}, \dot{z})\)
\(\mathcal{S} = \mathbb{R}^6\)
Additional state variables:
Partially Observable Markov Decision Process (POMDP)
\(s = (x, y, z, \dot{x}, \dot{y}, \dot{z})\)
\(\mathcal{S} = \mathbb{R}^6\)
Additional state variables:
Belief MDP
\(s = (x, y, z, \dot{x}, \dot{y}, \dot{z})\)
\(\mathcal{S} = \mathbb{R}^6\)
Additional state variables:
1. Minimize Entropy
2. Minimize Time to Decision
Two choices for info-gathering reward
Resolution Time
Entropy
[Dagan, Becker & Sunberg, AAMAS '25 (under review)]
Natural Language Hypothesis
"The satellite deployed a 1m\(^2\) solar panel"
Code that simulates all effects of panel
(drag, change in natural frequency, etc.)
Agent descriptor
RESEARCH_AGENT_DESCRIPTION =
"""You are a helpful AI assistant, collaborating with other assistants.\n
The task at hand is {task}\n
Use the provided tools to progress towards fulfilling the task\n
If you are unable to fully answer, that's OK, another assistant with different tools \n
will help where you left off. Make as much progress as you can, but DO NOT WRITE CODE!\n
When you receive code from a code generating agent, examine it carefully to check if
it answers the full task instruction,
if it doesn't, reflect on the the missing parts and return it to the code_generator agent.
If you or any of the other assistant have the final answer or deliverable,
prefix your response with REQUEST FEEDBACK so the team knows wait to user input.\n
You have access to the following tools: {tool_names}.
before calling a tool ask yourself if you know the answer to what you are looking for,
if the answer is yes, don't call the tool. \n{system_message}"""
General prompt:
Add a Dubins car model to the code given in the prompt.
The model needs to be simple, with only x,y derivatives.
Once you code it up, finish.
Followed by the code
# The Dubins car dynamic model in x-y plane is described
# by the following differential equations:
# \[
# \begin{align*}
# \dot{x} &= v \cdot \cos(\theta)
# \\
# \dot{y} &= v \cdot \sin(\theta)
# \end{align*}
# \]
module gptMultiDynamicsFunction
using ..MDHPOMDP
using StaticArrays
export dubins_car_dynamics
struct dubins_car_dynamics <: Function
v::Float64
theta::Float64
end
function (dyn::dubins_car_dynamics)(pos::Vec2)
x = pos[1]
y = pos[2]
return SVector(dyn.v*cos(dyn.theta), dyn.v*sin(dyn.theta))
end
end # module gptMultiDynamicsFunction
What if an adversary is actively trying to deceive you?
Evader strategy:
Move away from pursuer
Embedded in \(T(s' \mid s, a)\)
Image: Russel & Norvig, AI, a modern approach
P1: A
P1: K
P2: A
P2: A
P2: K
Regret Matching
(External Sampling Counterfactual Regret Minimization)
[Becker & Sunberg, AAMAS '25 (under review)]
[Becker & Sunberg, AAMAS '25 (under review)]
Incentive to deviate:
-1, -1 | -10, 0 | |
0, -10 | -5, -5 |
\(\sigma^1_1\)
\(\sigma^1_2\)
\(\ldots\)
\(\sigma^2_1\)
\(\sigma^2_2\)
\(\vdots\)
-1.01, -1.20 | -9.82, 0.12 | |
-0.10, -10.5 | -4.89, -5.02 |
\(\sigma^1_1\)
\(\sigma^1_2\)
\(\ldots\)
\(\sigma^2_1\)
\(\sigma^2_2\)
\(\vdots\)
Incentive to deviate in approximate game \(\hat{A}\)
Maximum value approximation error (\(E^i = A^i - \hat{A}^i \))
Challenges
Dr. Ofer Dagan
Postdoctoral Associate
Tyler Becker
Doctoral Student
Regret Matching
Average regret bounds deviation incentive
Incentive to deviate makes a policy suboptimal
For a single agent:
For multiple agents:
Aleatory
Epistemic (Static)
Epistemic (Dynamic)
Interaction
MDP
RL
POMDP
Game
Video: Eric Frew
Video: Eric Frew
Video: Eric Frew
Markov Decision Process (MDP)
Aleatory
\([x, y, z,\;\; \phi, \theta, \psi,\;\; u, v, w,\;\; p,q,r]\)
\(\mathcal{S} = \mathbb{R}^{12}\)
\(\mathcal{S} = \mathbb{R}^{12} \times \mathbb{R}^\infty\)
\[\underset{\pi:\, \mathcal{S} \to \mathcal{A}}{\text{maximize}} \quad \text{E}\left[ \sum_{t=0}^\infty R(s_t, a_t) \right]\]
State
Timestep
Environment
Belief Updater
True State
\(s = 7\)
Observation \(o = -0.21\)
\(b\)
\[b_t(s) = P\left(s_t = s \mid b_0, a_0, o_1 \ldots a_{t-1}, o_{t}\right)\]
\[ = P\left(s_t = s \mid b_{t-1}, a_{t-1}, o_{t}\right)\]
\(a\)
\[b_{t+1}(s') \propto Z(o \mid a, s')\int_{s \in \mathcal{S}} T(s' \mid s, a) b_t(s) ds\]
\(O(|\mathcal{S}|^2)\) for finite \(\mathcal{S}\)
State
Timestep
Environment
Belief Updater
Planner
\(a = +10\)
True State
\(s = 7\)
Observation \(o = -0.21\)
\(b\)
\[b_t(s) = P\left(s_t = s \mid b_0, a_0, o_1 \ldots a_{t-1}, o_{t}\right)\]
\[ = P\left(s_t = s \mid b_{t-1}, a_{t-1}, o_{t}\right)\]
\(Q(b, a)\)
Tree size: \(O\left(\left(|A||O|\right)^D\right)\)
\(d\) dimensions, \(k\) segments \(\,\rightarrow \, |S| = k^d\)
1 dimension
e.g. \(s = x \in S = \{1,2,3,4,5\}\)
\(|S| = 5\)
2 dimensions
e.g. \(s = (x,y) \in S = \{1,2,3,4,5\}^2\)
\(|S| = 25\)
3 dimensions
e.g. \(s = (x,y,x_h) \in S = \{1,2,3,4,5\}^3\)
\(|S| = 125\)
(Discretize each dimension into 5 segments)
\(x\)
\(y\)
\(x_h\)
Find \(\underset{s\sim b}{E}[f(s)]\)
\[=\sum_{s \in S} f(s) b(s)\]
Monte Carlo Integration
\(Q_N \equiv \frac{1}{N} \sum_{i=1}^N f(s_i)\)
\(s_i \sim b\) i.i.d.
\(\text{Var}(Q_N) = \text{Var}\left(\frac{1}{N} \sum_{i=1}^N f(s_i)\right)\)
\(= \frac{1}{N^2} \sum_{i=1}^N\text{Var}\left(f(s_i)\right)\)
\(= \frac{1}{N} \text{Var}\left(f(s_i)\right)\)
\[P(|Q_N - E[f(s_i)]| \geq \epsilon) \leq \frac{\text{Var}(f(s_i))}{N \epsilon^2}\]
(Bienayme)
(Chebyshev)
Curse of dimensionality!
\[b(s) \approx \sum_{i=1}^N \delta_{s}(s_i)\; w_i\]
\[b_{t+1}(s') \propto Z(o \mid a, s')\int_{s \in \mathcal{S}} T(s' \mid s, a) b_t(s)\]
\(\implies\) Sample \(s'_i\) from \(T(s' | s_i, a)\),
\(w'_i \propto w_i \times Z(o \mid a, s'_i)\)
POMDP Formulation
\(s=\left(x, y, \dot{x}, \left\{(x_c,y_c,\dot{x}_c,l_c,\theta_c)\right\}_{c=1}^{n}\right)\)
\(o=\left\{(x_c,y_c,\dot{x}_c,l_c)\right\}_{c=1}^{n}\)
\(a = (\ddot{x}, \dot{y})\), \(\ddot{x} \in \{0, \pm 1 \text{ m/s}^2\}\), \(\dot{y} \in \{0, \pm 0.67 \text{ m/s}\}\)
Ego external state
External states of other cars
Internal states of other cars
External states of other cars
Efficiency
Safety
MDP trained on normal drivers
MDP trained on all drivers
Omniscient
POMCPOW (Ours)
Simulation results
[Sunberg & Kochenderfer, T-ITS 2023]
Convergence?
\(\mathcal{P}\): state distribution conditioned on observations (belief)
\(\mathcal{Q}\): marginal state distribution (proposal)
Expand for all actions (\(\left|\mathcal{A}\right| = 2\) in this case)
...
Expand for all \(\left|\mathcal{S}\right|\) states
\(C=3\) states
\[|Q_{\mathbf{P}}^*(b,a) - Q_{\mathbf{M}_{\mathbf{P}}}^*(\bar{b},a)| \leq \epsilon \quad \text{w.p. } 1-\delta\]
For any \(\epsilon>0\) and \(\delta>0\), if \(C\) (number of particles) is high enough,
[Lim, Becker, Kochenderfer, Tomlin, & Sunberg, JAIR 2023]
No direct dependence on \(|\mathcal{S}|\) or \(|\mathcal{O}|\)!
\(C\) is too large for any direct safety guarantees. But, in practice, works extremely well for improving efficiency.
[Lim, Becker, Kochenderfer, Tomlin, & Sunberg, JAIR 2023]
Tree size: \(O\left(\left(|A|C\right)^D\right)\)
Solve simplified surrogate problem for policy deep in the tree
[Lim, Tomlin, and Sunberg, 2021]
[Deglurkar, Lim, Sunberg, & Tomlin, 2023]