Imitation and Inverse Reinforcement Learning

 

 

  • Today:

    • What if you don't know the reward function and just want to act like an expert?
      • Imitation Learning
      • Inverse Reinforcement Learning

Trivia: When was the first car driven with a Neural Network?

1995:    2797/2849 miles (98.2%)

Behavioral Cloning

\[\underset{\theta}{\text{maximize}} \prod_{(s, a) \in D} \pi_\theta (a \mid s)\]

Problem: Cascading Errors

How did ALVINN do it?

How did NVIDIA do it in 2016?

Dataset Aggregation (DAgger)

Stochastic Mixing Iterative Learning (SMILe)

\((1-\beta)^k\)

Generative Adversarial Imitation Learning (GAIL)

GANs are frighteningly good at generating believable synthetic things

\(\pi_\theta\)

\(C_\phi\)

Inverse Reinforcement Learning

What if we know the dynamics, but not the reward?

Reinforcement Learning

Inverse Reinforcement Learning

Input

Environment \((S, A, T, R)\)

\(S, A, T, \{\tau\}\)

Output

\(\pi^*\)

\(R\)

Exercise

What is the reward function?

Maximum Margin Inverse Reinforcement Learning

\(\beta(s, a) \in \{0,1\}^n\)

Principle of Maximum Entropy

\(H(X) = -\sum_x P(x) \log P(x)\)

Maximum Entropy Inverse Reinforcement Learning

Least informative trajectory distribution

Maximum Entropy Inverse Reinforcement Learning

Discounted visitation probability

Optimal policy under \(R_\phi\)

Recap

  • Behavioral cloning is supervised learning to match the actions of an expert
  • A critical problem is cascading errors, which can be addressed by gathering more data with DAgger or SMILe
  • Inverse reinforcement learning is the process of learning a reward functions from trajectories in an MDP
  • IRL is an underspecified problem
  • Maximum entropy RL solves this problem by choosing the reward function that maximizes the entropy of the trajectories of the resulting policy