BetaZero: Belief-State Planning for Long-Horizon POMDPs using Learned Approximations

(Essentially AlphaZero but the states are now beliefs)

Presented By: Himanshu Gupta

Date: 1/10/2024

Authors: Robert J. Moss, Anthony Corso, Jef Caers, Mykel J. Kochenderfer

MOTIVATION

  • The majority of real-life robotics problems are POMDPs.​​​​​​
    • And that should be the preferred mathematical framework to solve these problems.
  • But solving POMDPs exactly is hard :'(

MOTIVATION

  • So, we solve them approximately
     
    • Using tree search techniques

       
    • Using Deep RL

Why use Tree Search?

  • We specifically use Monte Carlo Tree search techniques
     
  • Find a solution online for a complex problem
    • Ex: Big and continuous state MDP
       
  • Better than full tree expansion using BFS or DFS
     
  • It has good/decent convergence guarantees
     
  • It has been empirically shown to work well
    • Used to solve games like Chess, Go, etc.  

Problems with Tree Search

#1 - Don't work for large or continuous action space problems.

Problems with Tree Search

#2 - Need good rollouts to guide tree search when planning time is limited.

Problems with Tree Search

#3 - Building the tree gets more difficult as the problem gets more complex.

Why use Deep RL?

  • Addresses limitations of tree search techniques
    • Can work for problems with large/continuous action space. No rollouts needed

Problems with Deep RL?

  • Learning is hard.
    • Especially difficult for long horizon problems with sparse rewards.

So, obvious solution?

  • Use the best of both worlds
    • Use learned policy and value functions to guide tree search

This is the idea behind AlphaZero

AlphaZero for MDPs - RECAP

Step 2: Policy Improvement

1) Perform Tree Search

Step 1: Policy Evaluation

2) At every node, sample actions from your learned policy.

3) Replace Rollouts with estimates from your learnt network

1) Compute the cross entropy loss for the policy network and MSE loss for the value network.

2) Do backpropagation and update weights.

BetaZero 

Basically AphaZero for POMDPs

POMDPs are Belief MDPs

MDP

  • State: s ~ S
  • Action: a ~ A
  • s' ~ T(s,a)
  • Reward: R(s,a)

Belief MDP aka POMDP

  • State: b ~ B
  • Action: a ~ A
  • b' ~ T(b,a)
  • Reward: R(b,a)

For discrete state problems, b is just a vector of size |S| and represents the probability of the environment being in that state.

POMDPs are Belief MDPs

MDP

  • State: s ~ S
  • Action: a ~ A
  • s' ~ T(s,a)
  • Reward: R(s,a)

Belief MDP aka POMDP

  • State: b ~ B
  • Action: a ~ A
  • b' ~ T(b,a)
  • Reward: R(b,a)

For continuous state problems, b can be represented either using a gaussian distribution or a particle filter.

POMDPs are Belief MDPs

$$V^*(s) = \underset{a\in\mathcal{A}}{\max} \left\{R(s, a) + \gamma E\Big[V^*\left(s_{t+1}\right) \mid s_t=s, a_t=a\Big]\right\}$$

MDP

  • State: s ~ S
  • Action: a ~ A
  • s' ~ T(s,a)
  • Reward: R(s,a)

Belief MDP aka POMDP

  • State: b ~ B
  • Action: a ~ A
  • b' ~ T(b,a)
  • Reward: R(b,a)

$$V^*(b) = \underset{a\in\mathcal{A}}{\max} \left\{R(b, a) + \gamma E\Big[V^*\left(b_{t+1}\right) \mid b_t=b, a_t=a\Big]\right\}$$

Value function for MDPs

Value function for POMDPs

BetaZero Algorithm

BetaZero Algorithm

BetaZero Algorithm

BetaZero Algorithm

BetaZero Algorithm

Loss Function : 

Experiments

Problem 1: LaserTag (5) and LaserTag (10)

Results

Results

Results

Limitations?

1. Belief Representation and input to the network.

2. Can't use this for continuous action problems?

\(\pi_t\) is the policy distribution from MCTS

\(p_t\) is the policy distribution from the trained network

(Discrete and are probability values)

(Continuous and are density values)

Our Fix : Use MGM instead

Our Fix : Use wasserstein
distance or the  dirac delta distribution math to justify sampling?

?

THE END

BetaZero Paper Presentation - CAIRO Lab Meeting

By Himanshu Gupta

BetaZero Paper Presentation - CAIRO Lab Meeting

  • 32