(Essentially AlphaZero but the states are now beliefs)
Presented By: Himanshu Gupta
Date: 1/10/2024
Authors: Robert J. Moss, Anthony Corso, Jef Caers, Mykel J. Kochenderfer
#1 - Don't work for large or continuous action space problems.
#2 - Need good rollouts to guide tree search when planning time is limited.
#3 - Building the tree gets more difficult as the problem gets more complex.
This is the idea behind AlphaZero
Step 2: Policy Improvement
1) Perform Tree Search
Step 1: Policy Evaluation
2) At every node, sample actions from your learned policy.
3) Replace Rollouts with estimates from your learnt network
1) Compute the cross entropy loss for the policy network and MSE loss for the value network.
2) Do backpropagation and update weights.
Basically AphaZero for POMDPs
MDP
Belief MDP aka POMDP
For discrete state problems, b is just a vector of size |S| and represents the probability of the environment being in that state.
MDP
Belief MDP aka POMDP
For continuous state problems, b can be represented either using a gaussian distribution or a particle filter.
$$V^*(s) = \underset{a\in\mathcal{A}}{\max} \left\{R(s, a) + \gamma E\Big[V^*\left(s_{t+1}\right) \mid s_t=s, a_t=a\Big]\right\}$$
MDP
Belief MDP aka POMDP
$$V^*(b) = \underset{a\in\mathcal{A}}{\max} \left\{R(b, a) + \gamma E\Big[V^*\left(b_{t+1}\right) \mid b_t=b, a_t=a\Big]\right\}$$
Value function for MDPs
Value function for POMDPs
Loss Function :
Problem 1: LaserTag (5) and LaserTag (10)
1. Belief Representation and input to the network.
2. Can't use this for continuous action problems?
\(\pi_t\) is the policy distribution from MCTS
\(p_t\) is the policy distribution from the trained network
(Discrete and are probability values)
(Continuous and are density values)
Our Fix : Use MGM instead
Our Fix : Use wasserstein
distance or the dirac delta distribution math to justify sampling?