(Lecture content from Sergey Levine's CS 285 at Berkeley)
Could an RL agent be better at Montezuma's revenge after watching Indiana Jones?
Transfer Learning: Use experience from one set of tasks for faster learning and better performance on a new task
In RL, task=MDP
Source domain \(\rightarrow\) target domain
How should prior knowledge be stored?
How should prior knowledge be stored?
Pretrain: reward speed in any direction
Fine Tune: reward speed in specific direction
Key: Diversity
How should prior knowledge be stored?
All domains have same \(S, A, T, \gamma\)
Difference: \(R\)
\[Q^\pi(s, a) = E\left[ \sum_{t=0}^\infty \gamma^t R(s, a) \mid s_0 = s, a_0 = a \right]\]
Let \(R(s, a) = w^\top \phi(s, a)\) where \(\phi\) is a feature vector.
\[= E\left[ \sum_{t=0}^\infty \gamma^t w^\top \phi(s, a) \mid s_0 = s, a_0 = a \right]\]
\[= w^\top E\left[ \sum_{t=0}^\infty \gamma^t \phi(s, a) \mid s_0 = s, a_0 = a \right]\]
Successor Feature:
\[\psi^\pi(s, a) \equiv E\left[ \sum_{t=0}^\infty \gamma^t \phi(s, a) \mid s_0 = s, a_0 = a \right]\]
\[Q^\pi(s, a) = w^\top \psi^\pi(s, a)\]
Given \(\psi^\pi\), one can easily calculate \(Q'^\pi\) for a new reward function \(R' = w'^\top \phi\).
\(Q'^\pi = w'^\top \psi^\pi\)
\[\psi^\pi(s, a) \equiv E\left[ \sum_{t=0}^\infty \gamma^t \phi(s, a) \mid s_0 = s, a_0 = a \right]\]
\[Q^\pi(s, a) = w^\top \psi^\pi(s, a)\]
Important: Does this yield optimal policy for \(R'\)?
No!
\[Q^\pi(s, a) = R(s, a) + \gamma E[Q^\pi(s', \pi(s'))]\]
\[Q^*(s, a) = R(s, a) + \gamma \max_{a'} E[Q^*(s', a')]\]
How to use this in practice:
https://www.youtube.com/watch?v=1eYqV_vGlJY
Machine Learning Data Set
RL
Meta RL
\[\theta^* = \underset{\theta}{\text{argmax}}\, \text{E}_{\pi_\theta} [R(\tau)]\]
\[=f_\text{RL}(M)\]
\[\theta^* = \underset{\theta}{\text{argmax}}\, \sum_{i=1}^n \text{E}_{\pi_{\phi_i}} [R(\tau)]\]
\[\text{where } \phi_i = f_\theta (M_i)\]
Image: Sergey Levine CS285 slides
Important: Exploration can speed up Meta RL
Approach 1: Pose as POMDP
\(S = S_M \times \{1,\ldots,n\}\)
\(s = (s_M, i)\)
\(O = S_M\)
\(o = s_M\)
Approach 2: Gradient-Based Meta-RL (MAML)
RL: Policy Gradient
Model Agnostic Meta Learning (MAML) for RL
\[\theta^{k+1} \gets \theta^k + \alpha \nabla_\theta J(\theta^k)\]
\[f_\theta (M_i) = \theta + \alpha \nabla_\theta J_i (\theta)\]
\[\theta \gets \theta + \beta \sum_i \nabla_\theta J_i [\theta + \alpha \nabla_\theta J_i (\theta)]\]
Meta Policy Gradient
Meta Learning
Learning/Adaptation