DQN and Advanced Policy Gradient

Map

Challenges:

Exploration vs Exploitation
Credit Assignment
Generalization

Part I

DQN

Q-Learning with Neural Networks

Q-Learning:

\(Q(s, a) \gets Q(s, a) + \alpha\, \left(r + \gamma \max_{a'} Q(s', a') - Q(s, a)\right)\)

Neural Networks

\[\theta^* = \argmin_\theta \sum_{(x,y) \in \mathcal{D}} l(f_\theta(x), y)\]

\[\theta \gets \theta - \alpha \, \nabla_\theta\, l (f_\theta(x), y)\]

Deep Q learning:

Approximate \(Q\) with \(Q_\theta\)
What should \((x, y)\) be?
What should \(l\) be?

Candidate Algorithm:

loop

\(a \gets \text{argmax} \, Q(s, a) \, \text{w.p.} \, 1-\epsilon, \quad \text{rand}(A) \, \text{o.w.}\)

\(r \gets \text{act!}(\text{env}, a)\)

\(s' \gets \text{observe}(\text{env})\)

\(\theta \gets \theta - \alpha\, \nabla_\theta \left( r + \gamma \max_{a'} Q_\theta (s', a') - Q_\theta (s, a)\right)^2\)

\(s \gets s'\)

DQN: The Atari Benchmark

DQN: Problems with Naive Approach

Candidate Algorithm:

loop

\(a \gets \text{argmax} \, Q(s, a) \, \text{w.p.} \, 1-\epsilon, \quad \text{rand}(A) \, \text{o.w.}\)

\(r \gets \text{act!}(\text{env}, a)\)

\(s' \gets \text{observe}(\text{env})\)

\(\theta \gets \theta - \alpha\, \nabla_\theta \left( r + \gamma \max_{a'} Q_\theta (s', a') - Q_\theta (s, a)\right)^2\)

\(s \gets s'\)

Problems:

Samples Highly Correlated
Size-1 batches
Moving target

DQN

Q Network Structure:

Experience Tuple: \((s, a, r, s')\)

Loss:

\[l(s, a, r, s') = \left(r+\gamma \max_{a'} Q_{\theta'}(s', a') - Q_\theta (s, a)\right)^2\]

DQN

Q Network Structure:

Experience Tuple: \((s, a, r, s')\)

Loss:

\[l(s, a, r, s') = \left(r+\gamma \max_{a'} Q_{\theta'}(s', a') - Q_\theta (s, a)\right)^2\]

https://www.youtube.com/watch?v=SuZVyOlgVek

Rainbow

Double Q Learning
Prioritized Replay
(priority proportional to last TD error)
Dueling networks
Value network + advantage network
\(Q(s, a) = V(s) + A(s, a)\)
Multi-step learning
\((r_t + \gamma r_{t+1} + \ldots + \gamma^{n-1} r_{t+n-1} + \gamma \max Q_\theta(s_{t+n}, a') - Q_\theta(s_t, a_t))^2\)
Distributional RL
predict an entire distribution of values instead of just Q
Noisy Nets

Actual Learning Curves

Paper: For SALE: State-Action Representation Learning for Deep Reinforcement Learning

Part II

Improved Policy Gradients

Restricted Gradient Update

\[\widehat{\nabla U}(\theta) = \sum_{k=0}^d \nabla_\theta \log \pi_\theta (a_k \mid s_k) \gamma^{k} (r_{k,\text{to-go}}-r_\text{base}(s_k)) \]

\[\theta' = \theta + \alpha \widehat{\nabla U}(\theta)\]

Restricted Gradient Update

\[\widehat{\nabla U}(\theta) = \sum_{k=0}^d \nabla_\theta \log \pi_\theta (a_k \mid s_k) \gamma^{k} (r_{k,\text{to-go}}-r_\text{base}(s_k)) \]

\[U(\theta') \approx U(\theta) + \nabla U(\theta)^\top (\theta' - \theta)\]

\(\underset{\theta'}{\text{maximize}}\)

\(\text{subject to}\)

\[g(\theta, \theta') = \lVert\theta-\theta'\rVert^2_2 = \frac{1}{2}(\theta' - \theta)^\top (\theta' - \theta)\]

\(\mathbf{u} = \nabla U(\theta)\)

\[\theta' = \theta + \alpha \widehat{\nabla U}(\theta)\]

\[g(\theta, \theta') \leq \epsilon\]

Natural Gradient

TRPO and PPO

TRPO = Trust Region Policy Optimization

(Natural gradient + line search)

PPO = Proximal Policy Optimization

(Use clamped surrogate objective to remove the need for line search)

https://openai.com/blog/openai-baselines-ppo/

Part III

Actor-Critic

\[\nabla U(\theta) = E_\tau \left[\sum_{k=0}^d \nabla_\theta \log \pi_\theta (a_k \mid s_k) \gamma^{k} (r_{k,\text{to-go}}-r_\text{base}(s_k)) \right]\]

Advantage Function: \(A(s, a) = Q(s, a) - V(s)\)

Actor: \(\pi_\theta\)
Critic: \(Q_\phi\) and/or \(A_\phi\) and/or \(V_\phi\)

Can we combine value-based and policy-based methods?

Alternate between training Actor and Critic

Problem: Instability

Actor-Critic

Which should we learn? \(A\), \(Q\), or \(V\)?

\[\nabla U(\theta) = E_\tau \left[\sum_{k=0}^d \nabla_\theta \log \pi_\theta (a_k \mid s_k) \gamma^{k} (r_k + \gamma V_\phi (s_{k+1}) - V_\phi (s_k))) \right]\]

\(l(\phi) = E\left[\left(V_\phi(s) - V^{\pi_\theta}(s)\right)^2\right]\)

Generalized Advantage Estimation

\(A(s_k, a_k) \approx r_k + \gamma V_\phi (s_{k+1}) - V_\phi (s_k)\)

\(A(s_k, a_k) \approx \sum_{t=k}^\infty \gamma^{t-k} r_t - V_\phi (s_k) \)

\(A(s_k, a_k) \approx r_k + \gamma r_{k+1} + \ldots + \gamma^d r_{k+d} + \gamma^{d+1} V_\phi (s_{k+d+1}) - V_\phi (s_k)\)

let \(\delta_t = r_t + \gamma V_\phi (s_{t+1}) - V_\phi (s_t)\)

\[A_\text{GAE}(s_k, a_k) \approx \sum_{t=k}^\infty (\gamma \lambda)^{t-k} \delta_t\]

Recap

Alpha Zero: Actor Critic with MCTS

Use \(\pi_\theta\) and \(U_\phi\) in MCTS
Learn \(\pi_\theta\) and \(U_\phi\) from tree

140 DQN and Advanced Policy Gradient

By Zachary Sunberg

140 DQN and Advanced Policy Gradient

DQN and Advanced Policy Gradient

Map

Map

Part I

DQN

Q-Learning with Neural Networks

DQN: The Atari Benchmark

DQN: Problems with Naive Approach

DQN

DQN

Rainbow

Actual Learning Curves

Part II

Improved Policy Gradients

Restricted Gradient Update

Restricted Gradient Update

Natural Gradient

Natural Gradient

Natural Gradient

TRPO and PPO

Part III

Actor-Critic

Actor-Critic

Actor-Critic

Actor-Critic

Generalized Advantage Estimation

Generalized Advantage Estimation

Recap

Alpha Zero: Actor Critic with MCTS

140 DQN and Advanced Policy Gradient

More from Zachary Sunberg