Challenges:
\[\nabla U(\theta) = E_\tau \left[\sum_{k=0}^d \nabla_\theta \log \pi_\theta (a_k \mid s_k) \gamma^{k} (r_{k,\text{to-go}}-r_\text{base}(s_k)) \right]\]
Advantage Function: \(A(s, a) = Q(s, a) - V(s)\)
Can we combine value-based and policy-based methods?
Alternate between training Actor and Critic
Problem: Instability
Which should we learn? \(A\), \(Q\), or \(V\)?
\[\nabla U(\theta) = E_\tau \left[\sum_{k=0}^d \nabla_\theta \log \pi_\theta (a_k \mid s_k) \gamma^{k} (r_k + \gamma V_\phi (s_{k+1}) - V_\phi (s_k))) \right]\]
\(l(\phi) = E\left[\left(V_\phi(s) - V^{\pi_\theta}(s)\right)^2\right]\)
\(A(s_k, a_k) \approx r_k + \gamma V_\phi (s_{k+1}) - V_\phi (s_k)\)
\(A(s_k, a_k) \approx \sum_{t=k}^\infty \gamma^{t-k} r_t\)
\(A(s_k, a_k) \approx \sum_{t=k}^{d-1} \gamma^{t-k} r_t + \gamma^{d-k} r_d + \gamma V_\phi (s_{d+1}) - V_\phi (s_d)\)
let \(\delta_t = r_t + \gamma V_\phi (s_{t+1}) - V_\phi (s_t)\)
\[A_\text{GAE}(s_k, a_k) \approx \sum_{t=k}^\infty (\gamma \lambda)^{t-k} \delta_t\]
https://www.youtube.com/watch?v=tlOIHko8ySg
"As a general rule, it is better to design performance measures according to what one actually wants in the environment, rather than according to how one thinks the agent should behave." - Stuart Russell
Reward
Value
Â
\(B(s, a) \approx \frac{1}{\sqrt{\hat{N}(s)}}\) where \(\hat{N}(s)\) is a learned function approximation
Bellemare, et al. 2016 "Unifying Count-Based Exploration..."
\(B(s, a) = \lVert \hat{f}_\theta (s, a) - f^*(s, a) \rVert^2\)
What should \(f^*\) be?
"First return, then explore"
(Uber AI Labs)
\[U(\pi) = E \left[\sum_{t=0}^\infty \gamma^t \left(r_t + \alpha \mathcal{H}(\pi(\cdot \mid s_t))\right)\right]\]
Advantages:
Disadvantages
Â
Why not?
"simply multiplying the rewards generated from an environment by some scalar"
"Unfortunately, in recent reported results, it is not uncommon for the top-N trials to be selected from among several trials (Wu et al. 2017; Mnih et al. 2016)"
(According to Sergey Levine)
Model-Based RL
Model-Based Deep RL
Off-Policy
Q-Learning
Actor Critic
On Policy Policy Gradient
Evolutionary/Gradient Free
(Most people use SAC or PPO)