Kinodynamic Rollouts and Action Shielding for Intention-Aware Planning
William Pope
M.S. Thesis Defense
14 Nov 2022
Intention-Aware Planning
- Environments with other agents pose a challenge for path planning
- Agents act with intention, but their objectives are not known to others
- By observing past behavior, we can reason about likely future behavior
Pedestrian Problem
- Vehicle needs to navigate around pedestrians/obstacles to reach a goal
- Assumptions:
- Receives measurements of human positions
- Dynamics are deterministic, knows its own state
- Has emergency brake
- Not-at-fault for collisions if stopped
POMDPs
- Partially-observable Markov decision process (POMDP)
- Sequential decision-making framework
- State contains hidden elements
- Belief – probability over possible true states, informed by observations
- For the pedestrian problem:
s = \begin{bmatrix} \text{vehicle state} \\ \text{human positions} \\ \text{human intentions} \end{bmatrix}
o = \begin{bmatrix} \text{vehicle state} \\ \text{human positions} \end{bmatrix}
b = \begin{bmatrix} \text{intention probabilities} \end{bmatrix}
a = \begin{bmatrix} \text{vehicle controls} \end{bmatrix}
Online POMDP Planning
- State/belief space is infinitely large, intractable to solve for global policy
- Online planning:
- Search local area only with forward tree search
- Execute best action, take observation, update belief, repeat search
- POMDP solver is time-constrained by planning frequency
Preceding Work
- Baseline: POMDP speed control [Bai, H. (2015)]
- Tree searches 1D space along pre-planned path
- Rollout: adjust speed along path to avoid humans
- Improved: POMDP speed/steering control [Gupta, H. (2022)]
- Tree searches full state space
-
Rollout: adjust speed to avoid humans and find the goal
- Uses multi-query motion planning to guide rollout
Holonomic | Nonholonomic | |
Speed | Bai, H. | Bai, H. |
Speed/Steering | Gupta, H. |
Planning under Differential Constraints
- Vehicle trajectories must obey equations of motion for dynamical system
- In nonholonomic systems, state evolution is dependent on past states
Holonomic planning
Nonholonomic planning
\dot{x} = f(x, u, t)
Dynamics Model
- Bicycle model with velocity (\(v \geq 0\))
- Inputs: steering, acceleration (discrete set)
x = \begin{bmatrix} x \\ y \\ \theta \\ v \end{bmatrix}
u = \begin{bmatrix} \phi \\ a \end{bmatrix}
\dot{x} = f(x,u) = \begin{bmatrix} \dot{x} \\ \dot{y} \\ \dot{\theta} \\ \dot{v} \end{bmatrix} = \begin{bmatrix} v \cos{\theta} \\ v \sin{\theta} \\ v \frac{1}{l}\tan{\phi} \\ a \end{bmatrix}
steering limited based on velocity
Contributions and Overview
Contributions
- Extended POMDP planner to nonholonomic vehicles with new rollout policy
- Improved planner safety around pedestrians with online reachability analysis
- Demonstrated planner on new autonomous robot
Overview
- Chapter I: Hamilton-Jacobi-Bellman Rollouts
- Chapter II: Sparse Tree Shielding
- Chapter III: Implementation/Experiments
Chapter I:
Hamilton-Jacobi-Bellman Rollouts
Tree Search Value Estimates
- In online POMDP solvers, need to estimate value of each new node
- Collects information beyond reach of tree
- Value estimates:
- Influence macro scale behavior of planner
- Need to be representative of true POMDP execution/reward
DESPOT Rollout Policies
- In DESPOT, estimate upper/lower bound on value for each node
- UB: simulate path to goal (no humans)
- LB: simulate path to goal (with humans, brake to avoid)
- Requirements:
- Reach goal near-optimally from any state
- Avoid pedestrians along path (lower bound)
- Generate path near instantly
- Follow vehicle dynamics
Goal: create UB/LB rollout policies that meet requirements
Possible Policies
- Random rollout
- Will not find sparse reward in navigation problem
- Graph-based methods
- Difficult to satisfy differential constraints when connecting nodes
- Forward tree search
- Single-query, too slow for online rollouts
- Potential field methods
- Subject to local minima issues, not guaranteed to reach goal
-
Need global optimal control policy
- Hamilton-Jacobi-Bellman equation
Hamilton-Jacobi-Bellman Equation
V(x(t),t) = \max_{u}\{\int_{t}^{t_{f}} R(x(\tau),u(\tau)) d\tau\}
V(x(t),t) = \max_{u}\{\int_{t}^{t+dt} R(x(\tau),u(\tau)) d\tau + V(x(t+dt),t+dt)\}
- Uses dynamic programming to solve Hamilton-Jacobi PDE for optimal control
- Value: optimal reward-to-go from given state/time
- Reformulated with dynamic programming
- With optimal value function, can quickly get optimal control at any state
u^{*}(x(t),t) = \argmax_{u}(Q(x(t),u,t))
0 = \frac{\partial V}{\partial t} + \max_{u}\{\frac{\partial V}{\partial x}\cdot f(x,u) + R(x,u) \}
HJB PDE
Discrete Time HJB
- For discrete time systems, HJB is equivalent to value iteration
V(s) = \max_{a}\{R(s,a) + V(s')\}
- To solve over continuous state space, need to discretize into grid of states \((s)\)
- Next state \((s')\) usually not on grid, so value needs to be interpolated
- Generally requires discrete action set to optimize
Solving for Value Function
- Reward function (same as POMDP)
- if goal: \(R(s) = +1000.0\)
- if obstacle: \(R(s) = -1e6\)
- all actions: \(R(a) = -1.0\)
- Semi-Lagrangian PDE method: compute one-step lookahead at every node until values converge
- Produces \(n\)-dimensional value array
- Can be discontinuous
- Expensive due to dimensionality
Planning with Value Function
- For any state in continuous space, take action that maximizes interpolated Q-value
Faster Planning
- Rollout has sizable impact on POMDP computation cost
- Can speed up policy with offline information
- Exact: interpolates new Q-values for exact continuous state
- Approx: uses stored Q-values from nearest neighbor in state grid
Reactive-HJB Policy
- In order to simulate rollout path with humans, apply reactive acceleration based on distance to nearest human in scenario
- Rollout still needs to reach goal, so reactive controller is coupled with HJB
- RC chooses acceleration
- HJB chooses steering
- If RC accel is unsafe, rollout reverts to pure HJB policy
HJB as Pedestrian Problem Rollout
- Upper bound:
- Use exact HJB policy to simulate optimal human-free path
- Runs once per node
- Lower bound:
- Use approximate reactive policy to simulate path avoiding humans
- Runs once per scenario per node
Rollout Policy Comparison
- Lower bound policies reach goal, regardless of reactive acceleration input
Policy | Runtime [μs] |
---|---|
HJB | 18.064 |
Reactive-HJB | 6.083 |
Approx Reactive-HJB | 1.915 |
|A| = 21,
|A_{RC}| = 7
POMDP w/ HJB Rollouts
Performance by Rollout Choice
- No significant difference observed for faster lower bound policy
- Any improvement in tree size may be negated by looser bound
Lower Bound Policy | Average Time-to-Goal [s] | Average Node Rate [1/s] |
---|---|---|
Reactive-HJB | 17.9 | 759.4 |
Approx Reactive-HJB | 16.5 | 820.0 |
N = 10
Chapter II:
Sparse Tree Shielding
Pedestrian Collision Avoidance
- Future human collisions are predicted/penalized in POMDP tree
- However,
- DESPOT explores finite number of observation branches
- Collision penalty may be washed out by nearby safe nodes
- Vehicle assumed to have instant emergency brake
Goal: add safety mechanism to handle low probability collisions and finite braking ability
Ye, N., et al
Reach-Avoid Games
- Players:
- Vehicle (P1) wants to reach goal
- Humans and obstacles (P2) "want" to cause collisions
- At a given state:
- Vehicle wins if there exists a safe path to goal in finite time
- Obstacles win if all of vehicle's future paths can be blocked
- To ensure vehicle safety, must only visit states with known safe future paths
[Chen, M.]
Human Reachable Set
- Vehicle path is not safe if it intersects human's reachable set in space/time
- Approximating human FRS:
- From observation, propagate human toward all possible goals
- Define FRS as convex hull of states at each time step
Vehicle Safe Set
- Vehicle has two known invariant safe sets:
- In goal
- In free space with zero velocity (not at fault)
- For any other state to be safe, must have a path to these sets that does not intersect obstacles or human FRS
- Only need to prove existence of one such path
- More simply: vehicle must be able to safely stop from any visited state
Online Action Shielding
- DESPOT produces Q-values for all actions at \(t_{k+1}\)
- Shield filters out actions with human FRS intersections prior to \(t_{stop}\)
- Agent executes best action from remaining safe action set
POMDP w/ Shield
Performance
Humans in Environment | Average Time-to-Goal [s] | Interventions / Actions | Collisions / Trials |
---|---|---|---|
10 (off) | 15.4 | n/a | 0.1 |
10 (on) | 23.0 | 0.207 | 0 |
20 (off) | 17.2 | n/a | 0.2 |
20 (on) | 31.3 | 0.280 | 0 |
- Shield is able to prevent collisions, but at significant time cost
N = 10
Chapter III.a:
Experimental Robot
Marmot
- Built car-like robot to test planners in real-time/real-world environment
- Runs POMDP planner on-board
- Collects state data from motion capture system
Hardware
- Based on F1Tenth open-source design
- Modified for ADCL uses
- Larger computer, dual batteries, more sensor mounts
Hardware stack
Intel NUC i5
Software
- Using ROS1
- POMDP implemented in Julia (RobotOS.jl)
- Planner and belief updater in separate nodes
- Minimal integration effort for new planners
Online decision loop
\(f_{loop} = 2 \text{ Hz}\)
POMDP-ROS Demonstration
Chapter III.b:
BellmanPDEs.jl
BellmanPDEs.jl
- Created Julia package for HJB solver/planner
- Intended for standalone planning or POMDP rollouts
Conclusion
- HJB rollout enables use of planner in broader applications
- Action shielding provides safety while preserving planner behavior
Future work
- Implement more efficient HJB solvers for larger environments
- Add disturbances and low level controller to vehicle dynamics
- Include more social context in humans/planner
Questions
William Pope
Copy of Thesis
By Zachary Sunberg
Copy of Thesis
- 298