BetaZero: Belief-State Planning for Long-Horizon POMDPs using Learned Approximations
(Essentially AlphaZero but the states are now beliefs)
Presented By: Himanshu Gupta
Date: 1/10/2024
Authors: Robert J. Moss, Anthony Corso, Jef Caers, Mykel J. Kochenderfer
MOTIVATION
- The majority of real-life robotics problems are POMDPs.
- And that should be the preferred mathematical framework to solve these problems.
- But solving POMDPs exactly is hard :'(
MOTIVATION
- So, we solve them approximately
- Using tree search techniques
- Using Deep RL
- Using tree search techniques
Why use Tree Search?
- We specifically use Monte Carlo Tree search techniques
- Find a solution online for a complex problem
- Ex: Big and continuous state MDP
- Ex: Big and continuous state MDP
- Better than full tree expansion using BFS or DFS
- It has good/decent convergence guarantees
- It has been empirically shown to work well
- Used to solve games like Chess, Go, etc.
Problems with Tree Search
#1 - Don't work for large or continuous action space problems.
Problems with Tree Search
#2 - Need good rollouts to guide tree search when planning time is limited.
Problems with Tree Search
#3 - Building the tree gets more difficult as the problem gets more complex.
Why use Deep RL?
- Addresses limitations of tree search techniques
- Can work for problems with large/continuous action space. No rollouts needed
Problems with Deep RL?
- Learning is hard.
- Especially difficult for long horizon problems with sparse rewards.
So, obvious solution?
- Use the best of both worlds
- Use learned policy and value functions to guide tree search
This is the idea behind AlphaZero
AlphaZero for MDPs - RECAP
Step 2: Policy Improvement
1) Perform Tree Search
Step 1: Policy Evaluation
2) At every node, sample actions from your learned policy.
3) Replace Rollouts with estimates from your learnt network
1) Compute the cross entropy loss for the policy network and MSE loss for the value network.
2) Do backpropagation and update weights.
BetaZero
Basically AphaZero for POMDPs
POMDPs are Belief MDPs
MDP
- State: s ~ S
- Action: a ~ A
- s' ~ T(s,a)
- Reward: R(s,a)
Belief MDP aka POMDP
- State: b ~ B
- Action: a ~ A
- b' ~ T(b,a)
- Reward: R(b,a)
For discrete state problems, b is just a vector of size |S| and represents the probability of the environment being in that state.
POMDPs are Belief MDPs
MDP
- State: s ~ S
- Action: a ~ A
- s' ~ T(s,a)
- Reward: R(s,a)
Belief MDP aka POMDP
- State: b ~ B
- Action: a ~ A
- b' ~ T(b,a)
- Reward: R(b,a)
For continuous state problems, b can be represented either using a gaussian distribution or a particle filter.
POMDPs are Belief MDPs
$$V^*(s) = \underset{a\in\mathcal{A}}{\max} \left\{R(s, a) + \gamma E\Big[V^*\left(s_{t+1}\right) \mid s_t=s, a_t=a\Big]\right\}$$
MDP
- State: s ~ S
- Action: a ~ A
- s' ~ T(s,a)
- Reward: R(s,a)
Belief MDP aka POMDP
- State: b ~ B
- Action: a ~ A
- b' ~ T(b,a)
- Reward: R(b,a)
$$V^*(b) = \underset{a\in\mathcal{A}}{\max} \left\{R(b, a) + \gamma E\Big[V^*\left(b_{t+1}\right) \mid b_t=b, a_t=a\Big]\right\}$$
Value function for MDPs
Value function for POMDPs
BetaZero Algorithm
BetaZero Algorithm
BetaZero Algorithm
BetaZero Algorithm
BetaZero Algorithm
Loss Function :
Experiments
Problem 1: LaserTag (5) and LaserTag (10)
Results
Results
Results
Limitations?
1. Belief Representation and input to the network.
2. Can't use this for continuous action problems?
\(\pi_t\) is the policy distribution from MCTS
\(p_t\) is the policy distribution from the trained network
(Discrete and are probability values)
(Continuous and are density values)
Our Fix : Use MGM instead
Our Fix : Use wasserstein
distance or the dirac delta distribution math to justify sampling?
?
THE END
BetaZero Paper Presentation - CAIRO Lab Meeting
By Himanshu Gupta
BetaZero Paper Presentation - CAIRO Lab Meeting
- 32