Imitation and Inverse Reinforcement Learning

  • Last time:

    • Turn-taking zero sum games
    • Markov Games
    • Incomplete Information Games
  • Today:

    • What if you don't know the reward function and just want to act like an expert?
      • Imitation Learning
      • Inverse Reinforcement Learning

Trivia: When was the first car driven with a Neural Network?

1995:    2797/2849 miles (98.2%)

Behavioral Cloning

\[\underset{\theta}{\text{maximize}} \prod_{(s, a) \in D} \pi_\theta (a \mid s)\]

Problem: Cascading Errors

How did ALVINN do it?

How did NVIDIA do it in 2016?

Dataset Aggregation (DAgger)

Stochastic Mixing Iterative Learning (SMILe)

\((1-\beta)^k\)

Generative Adversarial Imitation Learning (GAIL)

GANs are frighteningly good at generating believable synthetic things

Inverse Reinforcement Learning

What if we know the dynamics, but not the reward?

Reinforcement Learning

Inverse Reinforcement Learning

Input

Environment \((S, A, T, R)\)

\(S, A, T, \{\tau\}\)

Output

\(\pi^*\)

\(R\)

Exercise

What is the reward function?

Maximum Margin Inverse Reinforcement Learning

\(\beta(s, a) \in \{0,1\}^n\)

Principle of Maximum Entropy

\(H(X) = -\sum_x P(x) \log P(x)\)

Maximum Entropy Inverse Reinforcement Learning

Least informative trajectory distribution

Maximum Entropy Inverse Reinforcement Learning

Recap

  • Behavioral cloning is supervised learning to match the actions of an expert
  • A critical problem is cascading errors, which can be addressed by gathering more data with DAgger or SMILe
  • Inverse reinforcement learning is the process of learning a reward functions from trajectories in an MDP
  • IRL is an underspecified problem
  • Maximum entropy RL solves this problem by choosing the reward function that maximizes the entropy of the trajectories of the resulting policy

260 Imitation and Inverse Reinforcement Learning

By Zachary Sunberg

260 Imitation and Inverse Reinforcement Learning

  • 49