BetaZero: Belief-State Planning for Long-Horizon POMDPs using Learned Approximations

(Essentially AlphaZero but the states are now beliefs)

Presented By: Himanshu Gupta

Date: 1/10/2024

Authors: Robert J. Moss, Anthony Corso, Jef Caers, Mykel J. Kochenderfer

MOTIVATION

The majority of real-life robotics problems are POMDPs.
- And that should be the preferred mathematical framework to solve these problems.

But solving POMDPs exactly is hard :'(

MOTIVATION

So, we solve them approximately
- Using tree search techniques
- Using Deep RL

Why use Tree Search?

We specifically use Monte Carlo Tree search techniques
Find a solution online for a complex problem
- Ex: Big and continuous state MDP
Better than full tree expansion using BFS or DFS
It has good/decent convergence guarantees
It has been empirically shown to work well
- Used to solve games like Chess, Go, etc.

Problems with Tree Search

#1 - Don't work for large or continuous action space problems.

Problems with Tree Search

#2 - Need good rollouts to guide tree search when planning time is limited.

Problems with Tree Search

#3 - Building the tree gets more difficult as the problem gets more complex.

Why use Deep RL?

Addresses limitations of tree search techniques
- Can work for problems with large/continuous action space. No rollouts needed

Problems with Deep RL?

Learning is hard.
- Especially difficult for long horizon problems with sparse rewards.

So, obvious solution?

Use the best of both worlds
- Use learned policy and value functions to guide tree search

This is the idea behind AlphaZero

AlphaZero for MDPs - RECAP

Source: https://aihub.org/2020/04/03/alphazero-learns-to-solve-quantum-problems/

Step 2: Policy Improvement

1) Perform Tree Search

Step 1: Policy Evaluation

2) At every node, sample actions from your learned policy.

3) Replace Rollouts with estimates from your learnt network

1) Compute the cross entropy loss for the policy network and MSE loss for the value network.

2) Do backpropagation and update weights.

BetaZero

Source: https://aihub.org/2020/04/03/alphazero-learns-to-solve-quantum-problems/

Basically AphaZero for POMDPs

POMDPs are Belief MDPs

MDP

State: s ~ S
Action: a ~ A
s' ~ T(s,a)
Reward: R(s,a)

Belief MDP aka POMDP

State: b ~ B
Action: a ~ A
b' ~ T(b,a)
Reward: R(b,a)

For discrete state problems, b is just a vector of size |S| and represents the probability of the environment being in that state.

POMDPs are Belief MDPs

MDP

State: s ~ S
Action: a ~ A
s' ~ T(s,a)
Reward: R(s,a)

Belief MDP aka POMDP

State: b ~ B
Action: a ~ A
b' ~ T(b,a)
Reward: R(b,a)

For continuous state problems, b can be represented either using a gaussian distribution or a particle filter.

POMDPs are Belief MDPs

$$V^*(s) = \underset{a\in\mathcal{A}}{\max} \left\{R(s, a) + \gamma E\Big[V^*\left(s_{t+1}\right) \mid s_t=s, a_t=a\Big]\right\}$$

MDP

State: s ~ S
Action: a ~ A
s' ~ T(s,a)
Reward: R(s,a)

Belief MDP aka POMDP

State: b ~ B
Action: a ~ A
b' ~ T(b,a)
Reward: R(b,a)

$$V^*(b) = \underset{a\in\mathcal{A}}{\max} \left\{R(b, a) + \gamma E\Big[V^*\left(b_{t+1}\right) \mid b_t=b, a_t=a\Big]\right\}$$

Value function for MDPs

Value function for POMDPs

BetaZero Algorithm

BetaZero Algorithm

BetaZero Algorithm

BetaZero Algorithm

BetaZero Algorithm

Loss Function :

Experiments

Problem 1: LaserTag (5) and LaserTag (10)

Results

Results

Results

Limitations?

1. Belief Representation and input to the network.

2. Can't use this for continuous action problems?

\(\pi_t\) is the policy distribution from MCTS

\(p_t\) is the policy distribution from the trained network

(Discrete and are probability values)

(Continuous and are density values)

Our Fix : Use MGM instead

Our Fix : Use wasserstein
distance or the dirac delta distribution math to justify sampling?

?

THE END