Policy Gradient

Last Time

 

  • Bandits

Guiding Questions

 

  • What is Policy Optimization?
  • What is Policy Gradient?
  • What tricks are needed for it to work effectively?

Map

Challenges in RL

  • Exploration and Exploitation
  • Credit Assignment
  • Generalization

Policy Optimization

\[\underset{\theta}{\text{maximize}} \quad U(\pi_\theta) = U(\theta)\]

trajectory:

\(\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots s_d, a_d, r_d)\)

\[\underset{\pi}{\text{maximize}} \underset{s \sim b}{E} \left[\sum_{t=0}^\infty \gamma^t R(s_t, a_t) \mid s_0 = s, a_t = \pi(s_t) \right] \]

\[\underset{\pi}{\text{maximize}}\, U(\pi) = \underset{s \sim b}{E} \left[ U^\pi (s) \right]\]

Two approximations: 

\(a \sim \pi_\theta(a \mid s)\)

1. Parameterized stochastic policies

2. Monte Carlo Utility

Two classes of optimization algorithms: 

1. Zeroth order (use only \(U(\theta)\))

2. First order (use \(U(\theta)\) and \(\nabla_\theta U(\theta)\))

1. Zeroth-Order Optimization

Common zeroth-order aproaches:

  1. Genetic Algorithms
  2. Pattern Search
  3. Cross-Entropy

Cross Entropy:

Initialize \(d\)

loop:

    population \(\gets\) sample(\(d\))

    elite \(\gets\) \(m\) with highest \(U(\theta)\)

    \(d\) \(\gets\) fit(elite)

 

2. First Order Optimization

  • Definition of Gradient
  • Gradient Ascent
  • Stochastic Gradient Ascent

Tricks

For policy gradient, 3 tricks

  • Likelihood Ratio/Log Derivative
  • Reward to go
  • Baseline Subtraction

Log Derivative

Log Derivative

\[U(\theta) = \text{E}[R(\tau)]\]

\[= \int p_\theta(\tau) R(\tau)\, d\tau\]

\[\nabla U(\theta) = \nabla_\theta \int p_\theta(\tau) R(\tau)\, d\tau\]

\[= \int \nabla_\theta \, p_\theta(\tau) R(\tau) \, d\tau\]

\[= \int p_\theta(\tau) \nabla_\theta \log p_\theta (\tau) R(\tau)\, d\tau\]

\[\nabla_\theta\, \log p_\theta (\tau) = \nabla_\theta \, p_\theta(\tau) / p_\theta (\tau)\]

\[\therefore \quad \nabla_\theta \, p_\theta(\tau) = p_\theta (\tau)\, \nabla_\theta\, \log p_\theta (\tau)\]

\[= \text{E}\left[ \nabla_\theta \log p_\theta (\tau) R(\tau) \right]\]

Trajectory Probability Gradient

Trajectory Probability Gradient

\[\nabla_\theta \log p_\theta (\tau)\]

\[p_\theta (\tau) = p(s_0)\, \prod_{k=0}^d T(s_{k+1} \mid s_k, a_k) \, \pi_\theta(a_k \mid s_k) \]

\(\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots s_d, a_d, r_d)\)

\[\log p_\theta (\tau)\]

\[= \log p(s_0) + \sum_{k=0}^d \log T(s_{k+1} \mid s_k, a_k) + \sum_{k=0}^d \log \pi_\theta(a_k \mid s_k) \]

\[\nabla_\theta \log p_\theta (\tau)\]

\[= \sum_{k=0}^d \nabla_\theta \log \pi_\theta(a_k \mid s_k) \]

\[\nabla U(\theta) = \text{E} \left[ \sum_{k=0}^d \nabla_\theta \log \pi_\theta(a_k \mid s_k) R(\tau) \right]\]

Example

\[\pi_\theta (a=L \mid s=1) = \text{clamp}(\theta, 0, 1)\]

\[\pi_\theta (a=R \mid s=1) = \text{clamp}(1-\theta, 0, 1)\]

Given \(\theta = 0.2\) calculate \(\sum_{k=0}^d \nabla_\theta \log \pi_\theta(a_k \mid s_k) R(\tau) \) for two cases, (a) where \(a_0 = L\) and (b) where \(a_0 = R\)

\[\nabla U(\theta) = \text{E} \left[ \sum_{k=0}^d \nabla_\theta \log \pi_\theta(a_k \mid s_k) R(\tau) \right]\]

Policy Gradient

loop

    \(\tau \gets \text{simulate}(\pi_\theta)\)

    \(\theta \gets \theta + \alpha \sum_{k=0}^d \nabla_\theta \log \pi_\theta(a_k \mid s_k) R(\tau) \)

Causality

\[\nabla U(\theta) = \text{E} \left[ \sum_{k=0}^d \nabla_\theta \log \pi_\theta(a_k \mid s_k) R(\tau) \right]\]

\[= \text{E} \left[ \left(\sum_{k=0}^d \nabla_\theta \log \pi_\theta(a_k \mid s_k)\right) \left(\sum_{k=0}^d \gamma^k r_k \right)\right]\]

\[= \text{E} \left[ \left(f_0 + \ldots + f_d\right) \left( \gamma^0 r_0 + \ldots \gamma^d r_d \right)\right]\]

\[= \text{E} \left[ \sum_{k=0}^d \nabla_\theta \log \pi_\theta(a_k \mid s_k) \left(\sum_{l=k}^d \gamma^l r_l \right)\right]\]

\[= \text{E} \left[ \sum_{k=0}^d \nabla_\theta \log \pi_\theta(a_k \mid s_k)  \gamma^k r_{k, \text{to-go}} \right]\]

Baseline Subtraction

\[\nabla U(\theta)= \text{E} \left[ \sum_{k=0}^d \nabla_\theta \log \pi_\theta(a_k \mid s_k)  \gamma^k r_{k, \text{to-go}} \right]\]

\[\nabla U(\theta)= \text{E} \left[ \sum_{k=0}^d \nabla_\theta \log \pi_\theta(a_k \mid s_k)  \gamma^k \left(r_{k, \text{to-go}} - r_\text{base}(s_k) \right) \right]\]

Guiding Questions

 

  • What is Policy Gradient?
  • What tricks are needed for it to work effectively?

110-Policy-Gradient

By Zachary Sunberg

110-Policy-Gradient

  • 211