or
Google Brain Team, ICLR 2021
Presented by Professor Zachary Sunberg, August 26th, 2025
AI used to confirm connections between ideas, create images
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov; 2017
https://arxiv.org/pdf/1707.06347 (Not Peer Reviewed!)
Advantage \(A_t = Q(s_t, a_t) - V(s_t)\)
CPI = "Conservative Policy Iteration"
Problem: Can produce large update
=> only one step before collecting new data
(TRPO solved this with KL constraint/penalty)
Value
Entropy
"We don’t share parameters between the policy and value function (so coefficient c1 is irrelevant), and we don’t use an entropy bonus."
Story:
Cited 29262 times (as of last night)
(avg ~10x per day)
Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO
Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry; ICLR 2020
Implementation Details
"PPO’s marked improvement over TRPO (and even stochastic gradient descent) can be largely attributed to these optimizations."
Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO
Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry; ICLR 2020
AAI = Average Algorithmic Improvement
ACLI = Average Code Level Improvement
Thematic Groups:
Goal: Evaluate which design choices are most effective
s, r, done, info = env.step(a)
68 choices with ~3 values each \(\implies 3^{68} \approx 2.8 \times 10^{32}\) configurations.
~250,000 agents trained
(C24: Learning Rate included in some other groups)
Sample Uniformly for Choices within Group
Baseline
\(\approx\) ppo2
Group: Policy Losses
Group: Regularizers
Group: Optimizers
C23: Adam/RMSProp
C24: Learning Rate
C26: Momentum
Uniform sampling \(\implies\) sometimes bad combination \(\implies\) report 95th percentile
95th percentile
95% confidence interval for 95th percentile
(Very Selected - there are 92 figures)
Observation Normalization
Value Function Normalization
"Most surprising results"
Vindication!
Commendations
Criticisms
The 37 Implementation Details of PPO; 2022 ICLR Blog Track
Huang, Shengyi; Dossa, Rousslan Fernand Julien; Raffin, Antonin; Kanervisto, Anssi; Wang, Weixun