Himanshu Gupta
Date - 11 October 2023
Jin et al. (NeurIPS 2018)
Should I use Deep RL to solve complex sequential decision-making problems
Model-free RL approaches are more prevalent than model-based RL approaches for deep RL.
Online
Requires less space
More expressive and flexible
People I met at a robotics conference over the summer :
"Hell to the Yeah!"
No theoretical result or analysis to support or explain this empirical observation.
That leads to the theoretical question,
“Can we design model-free algorithms that are sample efficient?“
Answer unknown (until this paper came out)
Question was unanswered even for problems with finitely many states and actions
They showed that model-free algorithms that are sample-efficient can be designed.
The policy obtained using vanilla Q-learning with a UCB exploration policy and an additional bonus term is sample efficient.
Q-learning with \(\epsilon\)-greedy exploration policy is not
This paper presented the first-ever theoretical analysis on sample complexity and regret for model-free algorithms
\(\sqrt T\) regret or
\(O(1/\epsilon^2)\) samples for ε-optimal policy
They considered a tabular episodic Markov Decision Process
MDP (\(S,A,H,\mathbb{P},r\))
\(H\) : Number of steps in each episode
\(\mathbb{P}\): State transition matrix
\(\mathbb{P}_h(.|x,a)\)
\(r_h: S \times A \rightarrow [0,1]\)
Policy \( \pi \) - Collection of \(H\) functions
\( \{ \pi_h : S \rightarrow A \}_{h \in [H]} \)
State value function at step h under policy \(\pi\)
\(V_h^{\pi}: S \rightarrow \mathbb{R}\)
\(V_h^{\pi}(x) = E\Big[\sum_{h'=h}^H r_{h'}(x_{h'}, \pi_{h'}(x_{h'})) \mid x_h=x\Big]\)
State-Action value function at step h under policy \(\pi\)
\(Q_h^{\pi}: S \times A \rightarrow \mathbb{R}\)
\(Q_h^{\pi}(x,a) = r_h(x,a) + E\Big[\sum_{h'=h}^H r_{h'}(x_{h'}, \pi_{h'}(x_{h'})) \mid x_h=x,a_h=a\Big]\)
Since state,action and the horizon are all finite, there exists an optimal policy \( \pi^\star\)
Bellman Equation and Optimality Equation
where \( [\mathbb{P}_hV_{h+1}](x,a) = E_{x' \sim \mathbb{P_h}(.|x,a)}\Big[ V_{h+1}(x')\Big]\)
The agent plays the game for K episodes
So, the total number of steps, \(T = K*H\)
For each episode, an adversary picks the starting state \(x_1^k\) and the agent picks a policy \(\pi_k\)
Total expected Regret is defined as
\(Regret(K) = \sum_{k=1}^K \Big[ V_1^\star(x_1^k) - V_1^{\pi_k}(x_1^k) \Big] \)
Reinforcement Learning (RL) is a control-theoretic problem.
The agent tries to maximize its cumulative rewards via interacting with an unknown environment.
Two main approaches to RL:
Model-based algorithms: Learn a model for the environment by interactions, and generate a control policy based on this learned model.
Model free algorithms: Don’t learn the model. Instead, directly update the value function or the policy.
(Define Value function and policy?)
Probably approximately correct (PAC) learning theory
helps analyze whether and under what conditions a learner L will probably output an approximately correct classifier.
Approximate?
A learner is approximately correct if \(error_D(L) < \epsilon \) where D is the distribution over inputs.
Probably?
If L will output the correct classifier with probability \(1−\delta \) where \(0≤\delta≤0.5\), then L is probably approximately correct.
Probably approximately correct (PAC) learning theory
helps analyze whether and under what conditions a learner L will probably output an approximately correct classifier.
So, relation with sample size, m?
Question for PAC learning in RL?
Sample Complexity - intuitively, it means "how many examples are required to guarantee a probably approximately correct solution" (PAC Solution)
How does this relate to regret?
Intuitively:
Compute the regret - It will be a function of number of samples.
Goal is to minimize the regret. So, take derivative with respect to sample size and find the optimal value for it!
Prior work in the multi-arm bandit setting has shown that the choice of exploration policy plays an essential role in the efficiency of a learning algorithm.
In episodic MDP, Q-learning with the commonly used ε-greedy exploration strategy can be very inefficient
Model-based RL
This work’s main theoretical result— A sample complexity result for variants of Q-learning that incorporates UCB based exploration!
Two proposed algorithms are:
Q-learning with UCB-Hoeffding
Q-learning with UCB-Bernstein
Hoeffding inequality
Bernstein inequality
Theorem 1 shows that under a rather simple choice of exploration bonus, Q-learning can be made very efficient, enjoying a \(O(\sqrt T)\) regret.
Their regret slightly increases the dependency on H. However,
the algorithm is online
does not store additional data besides the table of Q values
Regret of Q-learning (UCB-H) is as good as the best model-based one
Same dependency on S, A and T.
In Q-learning (UCB-H), \(b_t\) is
and
Theorem 2 shows that for Q-learning with UCB-B exploration, the leading term in regret scales as \(\sqrt T\).
It also improves by a factor of \(\sqrt H\) over UCB-H exploration, at the cost of computing a more complicated exploration bonus term.
Theorem 2 has an additive term in its regret, which dominates the total regret when T is not very large compared with S, A and H.
Regret of UCB-B is only one \(\sqrt H\) factor worse than the best regret achieved by model-based algorithms. However,
the algorithm is online
does not store additional data besides the table of Q values
Regret of Q-learning (UCB-B) is as good as the best model-based one
Same dependency on S, A and T.
Theorem 3 shows that both variants of their algorithm are nearly optimal!
They only differ from the optimal regret by a factor of H and \(\sqrt H\)
Step 1: Define new terms and intermediate variables
Step 1: Define new terms and intermediate variables
Why such a value for \(\alpha_t\)? : to ensure regret is not exponential in H
Step 2: Define Lemmas
(Intuition: Just defining intermediate rules or properties of the variables that can be used for simplification)
Step 2: Define Lemmas
(Intuition: A lemma that gives a recursive formula for \(Q\) − \(Q^\star\) , as a weighted average of previous updates.)
Proof?
Step 2: Define Lemmas
(Intuition: lemma shows that \(Q^k\) is always an upper bound on \(Q^\star\) at any episode k, and \(Q\) − \(Q^\star\) can be bounded by quantities from the next step
Proof?
Step 3: Use Lemmas and other definitions to prove the Theorem
Proof? - The main idea of the rest of the proof is to upper bound \(\delta_h^k\) by values from the next step, \(\delta_{h+1}^k\).
Step 3: Use Lemmas and other definitions to prove the Theorem
Step 3: Use Lemmas and other definitions to prove the Theorem
They did not discuss how this analysis can be extended to other MDP settings.
Why choose an episodic MDP?
No mention of any future work in the paper.
Or intuition for extension to problems with continuous state and actions
Establishing connection between regret and sample efficiency should have been done in the earlier sections of the paper.
Reading:
If you found this interesting, and want to read similar work, then you should look check this out - https://sites.google.com/view/cjin/publications?authuser=0
Also, this paper - "When Is Partially Observable Reinforcement Learning Not Scary?"
Future Work:
I am interested in seeing if the same theoretical results can be generated to deep RL techniques.
They showed that model-free algorithms that are sample-efficient can be designed.
Q-learning, when equipped with a UCB exploration policy that incorporates estimates of the confidence of Q values and assigns exploration bonuses, achieves total regret O( √H3SAT).
S and A are the numbers of states and actions, H is the number of steps per episode, and T is the total number of steps.
Q-learning is online and has a significant advantage over model-based algorithms in terms of time and space complexities.
This is the first theoretical analysis for model-free algorithms—featuring √T regret or equivalently O(1/ε2) samples for ε-optimal policy.
They showed that model-free algorithms that are sample-efficient can be designed.
The policy obtained using vanilla Q-learning with a UCB exploration policy and an additional bonus term is sample efficient.
Q-learning with \(\epsilon\)-greedy exploration policy is not
This paper presented the first-ever theoretical analysis on sample complexity and regret for model-free algorithms
\(\sqrt T\) regret or
\(O(1/\epsilon^2)\) samples for ε-optimal policy