Himanshu Gupta
Zachary Sunberg
University of Colorado Boulder
Types of Uncertainty
OUTCOME
MODEL
STATE
Markov Model
Markov Decision Process (MDP)
Solving MDPs - The Value Function
$$V^*(s) = \underset{a\in\mathcal{A}}{\max} \left\{R(s, a) + \gamma E\Big[V^*\left(s_{t+1}\right) \mid s_t=s, a_t=a\Big]\right\}$$
Involves all future time
Involves only \(t\) and \(t+1\)
$$\underset{\pi:\, \mathcal{S}\to\mathcal{A}}{\mathop{\text{maximize}}} \, V^\pi(s) = E\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, \pi(s_t)) \bigm| s_0 = s \right]$$
$$Q(s,a) = R(s, a) + \gamma E\Big[V^* (s_{t+1}) \mid s_t = s, a_t=a\Big]$$
Value = expected sum of future rewards
Online Decision Process Tree Approaches
Time
Estimate \(Q(s, a)\) based on children
$$Q(s,a) = R(s, a) + \gamma E\Big[V^* (s_{t+1}) \mid s_t = s, a_t=a\Big]$$
\[V(s) = \max_a Q(s,a)\]
$$Q(b,a) = \rho(b, a) + \gamma \mathop{\mathbb{E}}_{b' \sim \mathbb{B}}\Big[V(b') \mid b, a\Big]$$
\[V(b) = \max_{a \in A} Q(b,a)\]
$$Q(b,a) = \rho(b, a) + \gamma \Big[\sum_{o \in \mathbb{O}} P(o|b,a) * V(\tau(b,a,o))\Big] $$
$$Q(b,a) = \rho(b, a) + \gamma \underset {o \sim Z(b,a)} \mathbb{E} \Big[V(\tau(b,a,o))\Big] $$
$$\rho(b,a) = -H(b) + R(b,a) $$
\[V(b) = \max_{a \in A} Q(b,a)\]
$$\rho(b,a) = -H(b) + R(b,a) $$
\[V(b) = \max_{a \in A} Q(b,a)\]
$$\rho(b,a) = -H(b) + R(b,a) $$
$$Q(b,a) = \rho(b, a) + \gamma \underset {o \sim Z(b,a)} \mathbb{E} \Big[V(\tau(b,a,o))\Big] $$
\[V(b) = \max_{a \in A} Q(b,a)\]
Partially Observable Markov Decision Process (POMDP)
State
Timestep
Accurate Observations
Goal: \(a=0\) at \(s=0\)
Optimal Policy
Localize
\(a=0\)
Environment
Belief Updater
Policy
\(b\)
\(a\)
\[b_t(s) = P\left(s_t = s \mid a_1, o_1 \ldots a_{t-1}, o_{t-1}\right)\]
True State
\(s = 7\)
Observation \(o = -0.21\)
OSSE is similar to a 1-step POMDP
[1] Christos H. Papadimitriou and John N. Tsitsiklis. 1987. The Complexity of Markov Decision Processes. Mathematics of Operations Research 12, 3 (1987), 441–450.
Action Nodes
Belief Nodes
[Ross, 2008] [Silver, 2010]
*(Partially Observable Monte Carlo Planning)
PROPOSED SOLUTION: Both these problems can be solved using some initial seed of "sensitive" regions to explore which can be obtained from ESA.
\(E \sum R(s, a)\)
\(O (|A|^D |Z|^D\))
\(O\)
\(O (|A|^D |Z|^D\))
OSSE is just a 1-step POMDP
\(b_0\)
\(b_1\)
\(b_2\)
Consider MART data
Don't Consider MART data
Low \(\rho\)
High \(\rho\)
PROPOSED SOLUTION: Both these problems can be solved using some initial seed of "sensitive" regions to explore which can be obtained from ESA.
SR1
SR2
SR3
SR4
SR3
UAV
SR1
SR2
SR3
SR4
SR3
UAV
Autorotation
Driving
POMDPs
POMCPOW
POMDPs.jl
Future
Environment
Belief Updater
Policy
\(o\)
\(b\)
\(a\)
[Ross, 2008] [Silver, 2010]
*(Partially Observable Monte Carlo Planning)
[ ] An infinite number of child nodes must be visited
[ ] Each node must be visited an infinite number of times
Solving continuous POMDPs - POMCP fails
POMCP
Double Progressive Widening (DPW): Gradually grow the tree by limiting the number of children to \(k N^\alpha\)
Necessary Conditions for Consistency
[Coutoux, 2011]
POMCP
POMCP-DPW
[Sunberg, 2018]
\[\underset{\pi: \mathcal{B} \to \mathcal{A}}{\mathop{\text{maximize}}} \, V^\pi(b)\]
\[\underset{a \in \mathcal{A}}{\mathop{\text{maximize}}} \, \underset{s \sim{} b}{E}\Big[Q_{MDP}(s, a)\Big]\]
Same as full observability on the next step
POMCP-DPW converges to QMDP
Proof Outline:
Observation space is continuous with finite density → w.p. 1, no two trajectories have matching observations
(1) → One state particle in each belief, so each belief is merely an alias for that state
(2) → POMCP-DPW = MCTS-DPW applied to fully observable MDP + root belief state
Solving this MDP is equivalent to finding the QMDP solution → POMCP-DPW converges to QMDP
[Sunberg, 2018]
POMCP-DPW
[ ] An infinite number of child nodes must be visited
[ ] Each node must be visited an infinite number of times
[ ] An infinite number of particles must be added to each belief node
Necessary Conditions for Consistency
Use \(Z\) to insert weighted particles
[Sunberg, 2018]
POMCP
POMCP-DPW
POMCPOW
[Sunberg, 2018]
Ours
Suboptimal
State of the Art
Discretized
[Ye, 2017] [Sunberg, 2018]
[Sunberg, 2018]
Ours
Suboptimal
State of the Art
Discretized
[Sunberg, 2018]
Ours
Suboptimal
State of the Art
Discretized
[Sunberg, 2018]
Ours
Suboptimal
State of the Art
Discretized
[Sunberg, 2018]
Ours
Suboptimal
State of the Art
Discretized
[Sunberg, 2018]
Autorotation
Driving
POMDPs
POMCPOW
POMDPs.jl
Future
POMDPs.jl - An interface for defining and solving MDPs and POMDPs in Julia
[Egorov, Sunberg, et al., 2017]
Celeste Project
1.54 Petaflops
Explicit
Black Box
("Generative" in POMDP lit.)
\(s,a\)
\(s', o, r\)
Previous C++ framework: APPL
"At the moment, the three packages are independent. Maybe one day they will be merged in a single coherent framework."
[Egorov, Sunberg, et al., 2017]
Autorotation
Driving
POMDPs
POMCPOW
POMDPs.jl
Future
Deploying autonomous agents with confidence
Practical Safety Gaurantees
Trusting Visual Sensors
Algorithms for Physical Problems
Physical Vehicles
Environment
Belief State
Convolutional Neural Network
Control System
Architecture for Safety Assurance
1. Continuous multi-dimensional action spaces
2. Data-driven models on modern parallel hardware
CPU Image By Eric Gaba, Wikimedia Commons user Sting, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=68125990
3. Better algorithms for existing POMDPs
Texas A&M (HUSL)
RC Car with assured visual sensing
Optimized Autorotation
Active Sensing
Project-Centric
1. Intro to Probabilistic Models
2. Markov Decision Processes
3. Reinforcement Learning
4. POMDPs
(More focus on online POMDP solutions than Stanford course)
The content of my research reflects my opinions and conclusions, and is not necessarily endorsed by my funding organizations.
POMDP Formulation
\(s=\left(x, y, \dot{x}, \left\{(x_c,y_c,\dot{x}_c,l_c,\theta_c)\right\}_{c=1}^{n}\right)\)
\(o=\left\{(x_c,y_c,\dot{x}_c,l_c)\right\}_{c=1}^{n}\)
\(a = (\ddot{x}, \dot{y})\), \(\ddot{x} \in \{0, \pm 1 \text{ m/s}^2\}\), \(\dot{y} \in \{0, \pm 0.67 \text{ m/s}\}\)
Ego physical state
Physical states of other cars
Internal states of other cars
Physical states of other cars
Efficiency
Safety
$$R(s, a, s') = \text{in\_goal}(s') - \lambda \left(\text{any\_hard\_brakes}(s, s') + \text{any\_too\_slow}(s')\right)$$
\(s=\left(x, y, \dot{x}, \left\{(x_c,y_c,\dot{x}_c,l_c,\theta_c)\right\}_{c=1}^{n}\right)\)
\(o=\left\{(x_c,y_c,\dot{x}_c,l_c)\right\}_{c=1}^{n}\)
\(a = (\ddot{x}, \dot{y})\)
Ego physical state
Physical states of other cars
Internal states of other cars
Physical states
Efficiency
Safety
\( - \lambda \left(\text{any\_hard\_brakes}(s, s') + \text{any\_too\_slow}(s')\right)\)
\(R(s, a, s') = \text{in\_goal}(s')\)
[Sunberg, 2017]
"[The autonomous vehicle] performed perfectly, except when it had to merge onto I-395 South and swing across three lanes of traffic"
- Bloomberg
http://bloom.bg/1Qw8fjB
Monte Carlo Tree Search
Image by Dicksonlaw583 (CC 4.0)
Autorotation
Driving
POMDPs
POMCPOW
POMDPs.jl
Future
Environment
Belief Updater
Policy
\(o\)
\(b\)
\(a\)
\[b_t(s) = P\left(s_t = s \mid a_1, o_1 \ldots a_{t-1}, o_{t-1}\right)\]
Online Decision Process Tree Approaches
State Node
Action Node
(Estimate \(Q(s,a)\) here)