Transfer and Meta Learning
-
Last time:
- Imitation Learning
- Inverse Reinforcement Learning
-
Today:
- How do we transfer knowledge from one domain to another? (e.g. simulated to real-world)
- How do we learn how to learn? (Meta learning)
(Lecture content from Sergey Levine's CS 285 at Berkeley)
Could an RL agent be better at Montezuma's revenge after watching Indiana Jones?
Transfer Learning and Montezuma's Revenge
Transfer Learning
Transfer Learning: Use experience from one set of tasks for faster learning and better performance on a new task
In RL, task=MDP
Source domain \(\rightarrow\) target domain
- "shot" = number of attempts in the target domain
- "0-shot" = run policy in target domain
- "1-shot" = try task once
- "few shot"
Transfer Learning
How should prior knowledge be stored?
- Q-function
- Policy
- Model
- Features/hidden states
Representation Bottleneck
Transfer Learning
How should prior knowledge be stored?
- Q-function
- Policy
- Model
- Features/hidden states
Pretraining + Finetuning
Pretraining + Finetuning
Pretrain: reward speed in any direction
Fine Tune: reward speed in specific direction
CAD2RL
Key: Diversity
Actor Mimic
Transfer Learning
How should prior knowledge be stored?
- Q-function
- Policy
- Model
- Features/hidden states
Successor Features
All domains have same \(S, A, T, \gamma\)
Difference: \(R\)
\[Q^\pi(s, a) = E\left[ \sum_{t=0}^\infty \gamma^t R(s, a) \mid s_0 = s, a_0 = a \right]\]
Let \(R(s, a) = w^\top \phi(s, a)\) where \(\phi\) is a feature vector.
\[= E\left[ \sum_{t=0}^\infty \gamma^t w^\top \phi(s, a) \mid s_0 = s, a_0 = a \right]\]
\[= w^\top E\left[ \sum_{t=0}^\infty \gamma^t \phi(s, a) \mid s_0 = s, a_0 = a \right]\]
Successor Feature:
\[\psi^\pi(s, a) \equiv E\left[ \sum_{t=0}^\infty \gamma^t \phi(s, a) \mid s_0 = s, a_0 = a \right]\]
\[Q^\pi(s, a) = w^\top \psi^\pi(s, a)\]
Using successor features
Given \(\psi^\pi\), one can easily calculate \(Q'^\pi\) for a new reward function \(R' = w'^\top \phi\).
\(Q'^\pi = w'^\top \psi^\pi\)
\[\psi^\pi(s, a) \equiv E\left[ \sum_{t=0}^\infty \gamma^t \phi(s, a) \mid s_0 = s, a_0 = a \right]\]
\[Q^\pi(s, a) = w^\top \psi^\pi(s, a)\]
Important: Does this yield optimal policy for \(R'\)?
No!
\[Q^\pi(s, a) = R(s, a) + \gamma E[Q^\pi(s', \pi(s'))]\]
\[Q^*(s, a) = R(s, a) + \gamma \max_{a'} E[Q^*(s', a')]\]
How to use this in practice:
- Keep a family of good policies and associated successor features with a variety of weights.
- In target domain, start with best policy from this set and finetune/plan online
Meta Learning: Motivation
https://www.youtube.com/watch?v=1eYqV_vGlJY
Meta Learning
Machine Learning Data Set
Meta Reinforcement Learning
RL
Meta RL
\[\theta^* = \underset{\theta}{\text{argmax}}\, \text{E}_{\pi_\theta} [R(\tau)]\]
\[=f_\text{RL}(M)\]
\[\theta^* = \underset{\theta}{\text{argmax}}\, \sum_{i=1}^n \text{E}_{\pi_{\phi_i}} [R(\tau)]\]
\[\text{where } \phi_i = f_\theta (M_i)\]
Image: Sergey Levine CS285 slides
Important: Exploration can speed up Meta RL
Meta Reinforcement Learning
Approach 1: Pose as POMDP
\(S = S_M \times \{1,\ldots,n\}\)
\(s = (s_M, i)\)
\(O = S_M\)
\(o = s_M\)
Meta Reinforcement Learning
Approach 2: Gradient-Based Meta-RL (MAML)
RL: Policy Gradient
Model Agnostic Meta Learning (MAML) for RL
\[\theta^{k+1} \gets \theta^k + \alpha \nabla_\theta J(\theta^k)\]
\[f_\theta (M_i) = \theta + \alpha \nabla_\theta J_i (\theta)\]
\[\theta \gets \theta + \beta \sum_i \nabla_\theta J_i [\theta + \alpha \nabla_\theta J_i (\theta)]\]
Meta Policy Gradient
Meta Learning
Learning/Adaptation
Recap
- In Transfer Learning, the goal is to use training from one or more source domains to one or more target domains.
- Various methods exist to transfer via knowledge stored in the policy, model, value function, or other features.
- In Meta Learning, the goal is to learn how to master a new environment quickly.
- A meta learning problem can be posed as a POMDP.
- In model agnostic meta learning (MAML), the policy is parameterized so that one gradient step in the new environment will produce a good policy.
270 Transfer and Meta Learning
By Zachary Sunberg
270 Transfer and Meta Learning
- 384