Transfer and Meta Learning

  • Last time:

    • Imitation Learning
    • Inverse Reinforcement Learning
  • Today:

    • How do we transfer knowledge from one domain to another? (e.g. simulated to real-world)
    • How do we learn how to learn? (Meta learning)

(Lecture content from Sergey Levine's CS 285 at Berkeley)

Could an RL agent be better at Montezuma's revenge after watching Indiana Jones?

Transfer Learning and Montezuma's Revenge

Transfer Learning

Transfer Learning: Use experience from one set of tasks for faster learning and better performance on a new task

In RL, task=MDP

Source domain \(\rightarrow\) target domain

  • "shot" = number of attempts in the target domain
  • "0-shot" = run policy in target domain
  • "1-shot" = try task once
  • "few shot"

Transfer Learning

How should prior knowledge be stored?

  • Q-function
  • Policy
  • Model
  • Features/hidden states

Representation Bottleneck

Transfer Learning

How should prior knowledge be stored?

  • Q-function
  • Policy
  • Model
  • Features/hidden states

Pretraining + Finetuning

Pretraining + Finetuning

Pretrain: reward speed in any direction

Fine Tune: reward speed in specific direction

CAD2RL

Key: Diversity

Actor Mimic

Transfer Learning

How should prior knowledge be stored?

  • Q-function
  • Policy
  • Model
  • Features/hidden states

Successor Features

All domains have same \(S, A, T, \gamma\)

Difference: \(R\)

\[Q^\pi(s, a) = E\left[ \sum_{t=0}^\infty \gamma^t R(s, a) \mid s_0 = s, a_0 = a \right]\]

Let \(R(s, a) = w^\top \phi(s, a)\) where \(\phi\) is a feature vector.

\[= E\left[ \sum_{t=0}^\infty \gamma^t w^\top \phi(s, a) \mid s_0 = s, a_0 = a \right]\]

\[= w^\top E\left[ \sum_{t=0}^\infty \gamma^t \phi(s, a) \mid s_0 = s, a_0 = a \right]\]

Successor Feature:

\[\psi^\pi(s, a) \equiv E\left[ \sum_{t=0}^\infty \gamma^t \phi(s, a) \mid s_0 = s, a_0 = a \right]\]

\[Q^\pi(s, a) = w^\top \psi^\pi(s, a)\]

Using successor features

Given \(\psi^\pi\), one can easily calculate \(Q'^\pi\) for a new reward function \(R' = w'^\top \phi\).

\(Q'^\pi = w'^\top \psi^\pi\)

\[\psi^\pi(s, a) \equiv E\left[ \sum_{t=0}^\infty \gamma^t \phi(s, a) \mid s_0 = s, a_0 = a \right]\]

\[Q^\pi(s, a) = w^\top \psi^\pi(s, a)\]

Important: Does this yield optimal policy for \(R'\)?

No!

\[Q^\pi(s, a) = R(s, a) + \gamma E[Q^\pi(s', \pi(s'))]\]

\[Q^*(s, a) = R(s, a) + \gamma \max_{a'} E[Q^*(s', a')]\]

How to use this in practice:

  • Keep a family of good policies and associated successor features with a variety of weights.
  • In target domain, start with best policy from this set and finetune/plan online

Meta Learning: Motivation

https://www.youtube.com/watch?v=1eYqV_vGlJY

Meta Learning

Machine Learning Data Set

Meta Reinforcement Learning

RL

Meta RL

\[\theta^* = \underset{\theta}{\text{argmax}}\, \text{E}_{\pi_\theta} [R(\tau)]\]

\[=f_\text{RL}(M)\]

\[\theta^* = \underset{\theta}{\text{argmax}}\, \sum_{i=1}^n \text{E}_{\pi_{\phi_i}} [R(\tau)]\]

\[\text{where } \phi_i = f_\theta (M_i)\]

Image: Sergey Levine CS285 slides

Important: Exploration can speed up Meta RL

Meta Reinforcement Learning

Approach 1: Pose as POMDP

\(S = S_M \times \{1,\ldots,n\}\)

\(s = (s_M, i)\)

\(O = S_M\)

\(o = s_M\)

Meta Reinforcement Learning

Approach 2: Gradient-Based Meta-RL (MAML)

RL: Policy Gradient

Model Agnostic Meta Learning (MAML) for RL

\[\theta^{k+1} \gets \theta^k + \alpha \nabla_\theta J(\theta^k)\]

\[f_\theta (M_i) = \theta + \alpha \nabla_\theta J_i (\theta)\]

\[\theta \gets \theta + \beta \sum_i \nabla_\theta J_i [\theta + \alpha \nabla_\theta J_i (\theta)]\]

Meta Policy Gradient

Meta Learning

Learning/Adaptation

Recap

  • In Transfer Learning, the goal is to use training from one or more source domains to one or more target domains.
  • Various methods exist to transfer via knowledge stored in the policy, model, value function, or other features.
  • In Meta Learning, the goal is to learn how to master a new environment quickly.
  • A meta learning problem can be posed as a POMDP.
  • In model agnostic meta learning (MAML), the policy is parameterized so that one gradient step in the new environment will produce a good policy.

270 Transfer and Meta Learning

By Zachary Sunberg

270 Transfer and Meta Learning

  • 254