Imitation and Inverse Reinforcement Learning
-
Today:
- What if you don't know the reward function and just want to act like an expert?
- Imitation Learning
- Inverse Reinforcement Learning
- What if you don't know the reward function and just want to act like an expert?
Trivia: When was the first car driven with a Neural Network?
1995: 2797/2849 miles (98.2%)
Behavioral Cloning
\[\underset{\theta}{\text{maximize}} \prod_{(s, a) \in D} \pi_\theta (a \mid s)\]
Problem: Cascading Errors
How did ALVINN do it?
How did NVIDIA do it in 2016?
Dataset Aggregation (DAgger)
Stochastic Mixing Iterative Learning (SMILe)
\((1-\beta)^k\)
Generative Adversarial Imitation Learning (GAIL)
GANs are frighteningly good at generating believable synthetic things
\(\pi_\theta\)
\(C_\phi\)
Inverse Reinforcement Learning
What if we know the dynamics, but not the reward?
Reinforcement Learning
Inverse Reinforcement Learning
Input
Environment \((S, A, T, R)\)
\(S, A, T, \{\tau\}\)
Output
\(\pi^*\)
\(R\)
Exercise
What is the reward function?
Maximum Margin Inverse Reinforcement Learning
\(\beta(s, a) \in \{0,1\}^n\)
Principle of Maximum Entropy
\(H(X) = -\sum_x P(x) \log P(x)\)
Maximum Entropy Inverse Reinforcement Learning
Least informative trajectory distribution
Maximum Entropy Inverse Reinforcement Learning
Discounted visitation probability
Optimal policy under \(R_\phi\)
Recap
- Behavioral cloning is supervised learning to match the actions of an expert
- A critical problem is cascading errors, which can be addressed by gathering more data with DAgger or SMILe
- Inverse reinforcement learning is the process of learning a reward functions from trajectories in an MDP
- IRL is an underspecified problem
- Maximum entropy RL solves this problem by choosing the reward function that maximizes the entropy of the trajectories of the resulting policy
260 Imitation and Inverse Reinforcement Learning
By Zachary Sunberg
260 Imitation and Inverse Reinforcement Learning
- 377