Online Algorithms for POMDPs with Continuous State, Action, and Observation Spaces

Zachary Sunberg

June 28, 2018

Partially Observable Markov Decision Process (POMDP)

  • S\mathcal{S} - State space
  • T:S×A×SRT:\mathcal{S}\times \mathcal{A} \times\mathcal{S} \to \mathbb{R} - Transition probability distribution
  • A\mathcal{A} - Action space
  • R:S×A×SRR:\mathcal{S}\times \mathcal{A} \times\mathcal{S} \to \mathbb{R} - Reward
  • O\mathcal{O} - Observation space
  • Z:S×A×S×ORZ:\mathcal{S} \times \mathcal{A}\times \mathcal{S} \times \mathcal{O} \to \mathbb{R} - Observation probability distribution

Solving MDPs and POMDPs - Offline vs Online

ONLINE

OFFLINE

Est. Value at Every State

Sequential Decision Trees

Monte Carlo Tree Search

Image by Dicksonlaw583 (CC 4.0)

QMDP

Equivalent to assuming full observability on the next step

Will not take costly exploratory actions

$$Q_{MDP}(s,a)$$

$$Q_{MDP}(b, a) = \sum_{s \in \mathcal{S}}Q_{MDP}(s,a) b(s) \geq Q^*(b,a)$$

\[Q_\pi (b,a) = E \left[\sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \mid s_0 \sim b, a_0 = a \right]\]

Monte Carlo Tree Search for POMDPs

  • POMCP uses simulations of histories instead of full belief updates

 

  • Each belief is implicitly represented by a collection of unweighted particles

 

Silver, David, and Joel Veness. "Monte-Carlo planning in large POMDPs." Advances in neural information processing systems. 2010.

Ross, Stéphane, et al. "Online planning algorithms for POMDPs." Journal of Artificial Intelligence Research 32 (2008): 663-704.

Light-Dark Problem

\begin{aligned} & \mathcal{S} = \mathbb{Z} \quad \quad \quad ~~ \mathcal{O} = \mathbb{R} \\ & s' = s+a \quad \quad o \sim \mathcal{N}(s, s-10) \\ & \mathcal{A} = \{-10, -1, 0, 1, 10\} \\ & R(s, a) = \begin{cases} 100 & \text{ if } a = 0, s = 0 \\ -100 & \text{ if } a = 0, s \neq 0 \\ -1 & \text{ otherwise} \end{cases} & \\ \end{aligned}
S=Z  O=Rs=s+aoN(s,s10)A={10,1,0,1,10}R(s,a)={100 if a=0,s=0100 if a=0,s01 otherwise\begin{aligned} & \mathcal{S} = \mathbb{Z} \quad \quad \quad ~~ \mathcal{O} = \mathbb{R} \\ & s' = s+a \quad \quad o \sim \mathcal{N}(s, s-10) \\ & \mathcal{A} = \{-10, -1, 0, 1, 10\} \\ & R(s, a) = \begin{cases} 100 & \text{ if } a = 0, s = 0 \\ -100 & \text{ if } a = 0, s \neq 0 \\ -1 & \text{ otherwise} \end{cases} & \\ \end{aligned}

State

Timestep

Accurate Observations

Goal: a=0a=0 at s=0s=0

Optimal Policy

Localize

a=0a=0

 

[  ] An infinite number of child nodes must be visited

[  ] Each node must be visited an infinite number of times

Solving continuous POMDPs - POMCP fails

[1] Adrien Coutoux, Jean-Baptiste Hoock, Nataliya Sokolovska, Olivier Teytaud, Nicolas Bonnard. Continuous Upper Confidence Trees. LION’11: Proceedings of the 5th International Conference on Learning and Intelligent OptimizatioN, Jan 2011, Italy. pp.TBA. <hal-00542673v2>

POMCP

Limit number of children to

kNαk N^\alpha

Necessary Conditions for Consistency [1]

 

POMCP

POMCP-DPW

POMCP-DPW converges to QMDP

Proof Outline:

  1. Observation space is continuous → observations unique w.p. 1.
     

  2. Can only insert state if observation matches exactly

  3. (1) → One state particle in each belief, so each belief is merely an alias for that state


  4. (2) → POMCP-DPW = MCTS-DPW applied to fully observable MDP + root belief state


  5. Solving this MDP is equivalent to finding the QMDP solution → POMCP-DPW converges to QMDP






  6.  
  7.  
  8.  
  9.  

Sunberg, Z. N. and Kochenderfer, M. J. "Online Algorithms for POMDPs with Continuous State, Action, and Observation Spaces", ICAPS (2018)

POMCP-DPW

 

[  ] An infinite number of child nodes must be visited

[  ] Each node must be visited an infinite number of times

[  ] An infinite number of particles must be added to each belief node

Necessary Conditions for Consistency

 

Use ZZ to insert weighted particles

POMCP

POMCP-DPW

POMCPOW

Alternative Methods

Particle Filter Tree (with DPW)

Discretization

Fixed number of particles \[m\]

DESPOT

Fixed number of scenarios, \(K\)

Somani, A., Ye, N., Hsu, D., & Lee, W. "DESPOT : Online POMDP Planning with Regularization." Journal of Artificial Intelligence Research, 2017

Ye, Nan, et al. "DESPOT: Online POMDP planning with regularization." Journal of Artificial Intelligence Research 58 (2017): 231-266.

Discretization

Light Dark

Sub Hunt

Sadigh, Dorsa, et al. "Information gathering actions over human internal state." Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on. IEEE, 2016.

Schmerling, Edward, et al. "Multimodal Probabilistic Model-Based Planning for Human-Robot Interaction." arXiv preprint arXiv:1710.09483 (2017).

Sadigh, Dorsa, et al. "Planning for Autonomous Cars that Leverage Effects on Human Actions." Robotics: Science and Systems. 2016.

Tweet by Nitin Gupta

29 April 2018

https://twitter.com/nitguptaa/status/990683818825736192

Human Behavior Model: IDM and MOBIL

\ddot{x}_\text{IDM} = a \left[ 1 - \left( \frac{\dot{x}}{\dot{x}_0} \right)^{\delta} - \left(\frac{g^*(\dot{x}, \Delta \dot{x})}{g}\right)^2 \right]
x¨IDM=a[1(x˙x˙0)δ(g(x˙,Δx˙)g)2] \ddot{x}_\text{IDM} = a \left[ 1 - \left( \frac{\dot{x}}{\dot{x}_0} \right)^{\delta} - \left(\frac{g^*(\dot{x}, \Delta \dot{x})}{g}\right)^2 \right]
g^*(\dot{x}, \Delta \dot{x}) = g_0 + T \dot{x} + \frac{\dot{x}\Delta \dot{x}}{2 \sqrt{a b}}
g(x˙,Δx˙)=g0+Tx˙+x˙Δx˙2abg^*(\dot{x}, \Delta \dot{x}) = g_0 + T \dot{x} + \frac{\dot{x}\Delta \dot{x}}{2 \sqrt{a b}}

M. Treiber, et al., “Congested traffic states in empirical observations and microscopic simulations,” Physical Review E, vol. 62, no. 2 (2000).

A. Kesting, et al., “General lane-changing model MOBIL for car-following models,” Transportation Research Record, vol. 1999 (2007).

A. Kesting, et al., "Agents for Traffic Simulation." Multi-Agent Systems: Simulation and Applications. CRC Press (2009).

POMDP Formulation

s=(x,y,x˙,{(xc,yc,x˙c,lc,θc)}c=1n)s=\left(x, y, \dot{x}, \left\{(x_c,y_c,\dot{x}_c,l_c,\theta_c)\right\}_{c=1}^{n}\right)

o={(xc,yc,x˙c,lc)}c=1no=\left\{(x_c,y_c,\dot{x}_c,l_c)\right\}_{c=1}^{n}

a=(x¨,y˙)a = (\ddot{x}, \dot{y}), x¨{0,±1 m/s2}\ddot{x} \in \{0, \pm 1 \text{ m/s}^2\}, y˙{0,±0.67 m/s}\dot{y} \in \{0, \pm 0.67 \text{ m/s}\}

R(s, a, s') = \text{in\_goal}(s') - \lambda \left(\text{any\_hard\_brakes}(s, s') + \text{any\_too\_slow}(s')\right)
R(s,a,s)=in_goal(s)λ(any_hard_brakes(s,s)+any_too_slow(s))R(s, a, s') = \text{in\_goal}(s') - \lambda \left(\text{any\_hard\_brakes}(s, s') + \text{any\_too\_slow}(s')\right)

Ego physical state

Physical states of other cars

Internal states of other cars

Physical states of other cars

  • Actions filtered so they can never cause crashes
  • Braking action always available

Efficiency

Safety

POMDPs.jl - An interface for defining and solving MDPs and POMDPs in Julia

Thank You!

Markov Model

  • S\mathcal{S} - State space
  • T:S×SRT:\mathcal{S}\times\mathcal{S} \to \mathbb{R} - Transition probability distributions

Markov Decision Process (MDP)

  • S\mathcal{S} - State space
  • T:S×A×SRT:\mathcal{S}\times \mathcal{A} \times\mathcal{S} \to \mathbb{R} - Transition probability distribution
  • A\mathcal{A} - Action space
  • R:S×A×SRR:\mathcal{S}\times \mathcal{A} \times\mathcal{S} \to \mathbb{R} - Reward

Previous C++ framework: APPL

"At the moment, the three packages are independent. Maybe one day they will be merged in a single coherent framework."

All drivers normal

Outcome only

Omniscient

Mean MPC

QMDP

POMCPOW

Simulation results

All drivers normal

Omniscient

Mean MPC

QMDP

POMCPOW

Monte Carlo Tree Search

Image by Dicksonlaw583 (CC 4.0)

ICAPS 2018 POMCPOW Talk

By Zachary Sunberg

ICAPS 2018 POMCPOW Talk

  • 735