Challenges in RL
\[\underset{\theta}{\text{maximize}} \quad U(\pi_\theta) = U(\theta)\]
trajectory:
\(\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots s_d, a_d, r_d)\)
\[\underset{\pi}{\text{maximize}} \underset{s \sim b}{E} \left[\sum_{t=0}^\infty \gamma^t R(s_t, a_t) \mid s_0 = s, a_t = \pi(s_t) \right] \]
\[\underset{\pi}{\text{maximize}}\, U(\pi) = \underset{s \sim b}{E} \left[ U^\pi (s) \right]\]
Two approximations:
\(a \sim \pi_\theta(a \mid s)\)
1. Parameterized stochastic policies
2. Monte Carlo Utility
Two classes of optimization algorithms:
1. Zeroth order (use only \(U(\theta)\))
2. First order (use \(U(\theta)\) and \(\nabla_\theta U(\theta)\))
Common zeroth-order aproaches:
Cross Entropy:
Initialize \(d\)
loop:
population \(\gets\) sample(\(d\))
elite \(\gets\) \(m\) with highest \(U(\theta)\)
\(d\) \(\gets\) fit(elite)
For policy gradient, 3 tricks
\[U(\theta) = \text{E}[R(\tau)]\]
\[= \int p_\theta(\tau) R(\tau)\, d\tau\]
\[\nabla U(\theta) = \nabla_\theta \int p_\theta(\tau) R(\tau)\, d\tau\]
\[= \int \nabla_\theta \, p_\theta(\tau) R(\tau) \, d\tau\]
\[= \int p_\theta(\tau) \nabla_\theta \log p_\theta (\tau) R(\tau)\, d\tau\]
\[\nabla_\theta\, \log p_\theta (\tau) = \nabla_\theta \, p_\theta(\tau) / p_\theta (\tau)\]
\[\therefore \quad \nabla_\theta \, p_\theta(\tau) = p_\theta (\tau)\, \nabla_\theta\, \log p_\theta (\tau)\]
\[= \text{E}\left[ \nabla_\theta \log p_\theta (\tau) R(\tau) \right]\]
\[\nabla_\theta \log p_\theta (\tau)\]
\[p_\theta (\tau) = p(s_0)\, \prod_{k=0}^d T(s_{k+1} \mid s_k, a_k) \, \pi_\theta(a_k \mid s_k) \]
\(\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots s_d, a_d, r_d)\)
\[\log p_\theta (\tau)\]
\[= \log p(s_0) + \sum_{k=0}^d \log T(s_{k+1} \mid s_k, a_k) + \sum_{k=0}^d \log \pi_\theta(a_k \mid s_k) \]
\[\nabla_\theta \log p_\theta (\tau)\]
\[= \sum_{k=0}^d \nabla_\theta \log \pi_\theta(a_k \mid s_k) \]
\[\nabla U(\theta) = \text{E} \left[ \sum_{k=0}^d \nabla_\theta \log \pi_\theta(a_k \mid s_k) R(\tau) \right]\]
\[\pi_\theta (a=L \mid s=1) = \text{clamp}(\theta, 0, 1)\]
\[\pi_\theta (a=R \mid s=1) = \text{clamp}(1-\theta, 0, 1)\]
Given \(\theta = 0.2\) calculate \(\sum_{k=0}^d \nabla_\theta \log \pi_\theta(a_k \mid s_k) R(\tau) \) for two cases, (a) where \(a_0 = L\) and (b) where \(a_0 = R\)
\[\nabla U(\theta) = \text{E} \left[ \sum_{k=0}^d \nabla_\theta \log \pi_\theta(a_k \mid s_k) R(\tau) \right]\]
loop
\(\tau \gets \text{simulate}(\pi_\theta)\)
\(\theta \gets \theta + \alpha \sum_{k=0}^d \nabla_\theta \log \pi_\theta(a_k \mid s_k) R(\tau) \)
\[\nabla U(\theta) = \text{E} \left[ \sum_{k=0}^d \nabla_\theta \log \pi_\theta(a_k \mid s_k) R(\tau) \right]\]
\[= \text{E} \left[ \left(\sum_{k=0}^d \nabla_\theta \log \pi_\theta(a_k \mid s_k)\right) \left(\sum_{k=0}^d \gamma^k r_k \right)\right]\]
\[= \text{E} \left[ \left(f_0 + \ldots + f_d\right) \left( \gamma^0 r_0 + \ldots \gamma^d r_d \right)\right]\]
\[= \text{E} \left[ \sum_{k=0}^d \nabla_\theta \log \pi_\theta(a_k \mid s_k) \left(\sum_{l=k}^d \gamma^l r_l \right)\right]\]
\[= \text{E} \left[ \sum_{k=0}^d \nabla_\theta \log \pi_\theta(a_k \mid s_k) \gamma^k r_{k, \text{to-go}} \right]\]
\[\nabla U(\theta)= \text{E} \left[ \sum_{k=0}^d \nabla_\theta \log \pi_\theta(a_k \mid s_k) \gamma^k r_{k, \text{to-go}} \right]\]
\[\nabla U(\theta)= \text{E} \left[ \sum_{k=0}^d \nabla_\theta \log \pi_\theta(a_k \mid s_k) \gamma^k \left(r_{k, \text{to-go}} - r_\text{base}(s_k) \right) \right]\]