Imitation and Inverse Reinforcement Learning

Last time:
 Turntaking zero sum games
 Markov Games
 Incomplete Information Games

Today:
 What if you don't know the reward function and just want to act like an expert?
 Imitation Learning
 Inverse Reinforcement Learning
 What if you don't know the reward function and just want to act like an expert?
Trivia: When was the first car driven with a Neural Network?
1995: 2797/2849 miles (98.2%)
Behavioral Cloning
\[\underset{\theta}{\text{maximize}} \prod_{(s, a) \in D} \pi_\theta (a \mid s)\]
Problem: Cascading Errors
How did ALVINN do it?
How did NVIDIA do it in 2016?
Dataset Aggregation (DAgger)
Stochastic Mixing Iterative Learning (SMILe)
\((1\beta)^k\)
Generative Adversarial Imitation Learning (GAIL)
GANs are frighteningly good at generating believable synthetic things
Inverse Reinforcement Learning
What if we know the dynamics, but not the reward?
Reinforcement Learning
Inverse Reinforcement Learning
Input
Environment \((S, A, T, R)\)
\(S, A, T, \{\tau\}\)
Output
\(\pi^*\)
\(R\)
Exercise
What is the reward function?
Maximum Margin Inverse Reinforcement Learning
\(\beta(s, a) \in \{0,1\}^n\)
Principle of Maximum Entropy
\(H(X) = \sum_x P(x) \log P(x)\)
Maximum Entropy Inverse Reinforcement Learning
Least informative trajectory distribution
Maximum Entropy Inverse Reinforcement Learning
Recap
 Behavioral cloning is supervised learning to match the actions of an expert
 A critical problem is cascading errors, which can be addressed by gathering more data with DAgger or SMILe
 Inverse reinforcement learning is the process of learning a reward functions from trajectories in an MDP
 IRL is an underspecified problem
 Maximum entropy RL solves this problem by choosing the reward function that maximizes the entropy of the trajectories of the resulting policy
260 Imitation and Inverse Reinforcement Learning
By Zachary Sunberg
260 Imitation and Inverse Reinforcement Learning
 49