ASEN 6519 - DMU ++ Paper Presentation
Himanshu Gupta
Date - 11 October 2023
Jin et al. (NeurIPS 2018)
Is Q-learning Provably Efficient?
MOTIVATION
-
Should I use Deep RL to solve complex sequential decision-making problems
-
Model-free RL approaches are more prevalent than model-based RL approaches for deep RL.
-
Online
-
Requires less space
-
More expressive and flexible
-
-
People I met at a robotics conference over the summer :
-
"Hell to the Yeah!"
-
MOTIVATION
- However, it has been shown "empirically" that model-free algorithms suffer from a higher sample complexity than model-based approaches.
- To train a physical robot for a simple task, a Model-based method may take about 20 minutes while a Policy Gradient method may take weeks.
MOTIVATION
-
No theoretical result or analysis to support or explain this empirical observation.
-
That leads to the theoretical question,
“Can we design model-free algorithms that are sample efficient?“
- Should we combine model-free approaches with model-based approaches?
-
Answer unknown (until this paper came out)
-
Question was unanswered even for problems with finitely many states and actions
CONTRIBUTION
-
They showed that model-free algorithms that are sample-efficient can be designed.
-
The policy obtained using vanilla Q-learning with a UCB exploration policy and an additional bonus term is sample efficient.
-
Q-learning with \(\epsilon\)-greedy exploration policy is not
-
-
This paper presented the first-ever theoretical analysis on sample complexity and regret for model-free algorithms
-
\(\sqrt T\) regret or
-
\(O(1/\epsilon^2)\) samples for ε-optimal policy
-
Problem Setting\(\rightarrow\) Background
PROBLEM SETTING
-
They considered a tabular episodic Markov Decision Process
-
MDP (\(S,A,H,\mathbb{P},r\))
-
\(H\) : Number of steps in each episode
-
\(\mathbb{P}\): State transition matrix
-
\(\mathbb{P}_h(.|x,a)\)
-
-
\(r_h: S \times A \rightarrow [0,1]\)
-
-
Policy \( \pi \) - Collection of \(H\) functions
-
\( \{ \pi_h : S \rightarrow A \}_{h \in [H]} \)
-
PROBLEM SETTING
-
State value function at step h under policy \(\pi\)
-
\(V_h^{\pi}: S \rightarrow \mathbb{R}\)
-
\(V_h^{\pi}(x) = E\Big[\sum_{h'=h}^H r_{h'}(x_{h'}, \pi_{h'}(x_{h'})) \mid x_h=x\Big]\)
-
-
State-Action value function at step h under policy \(\pi\)
-
\(Q_h^{\pi}: S \times A \rightarrow \mathbb{R}\)
-
\(Q_h^{\pi}(x,a) = r_h(x,a) + E\Big[\sum_{h'=h}^H r_{h'}(x_{h'}, \pi_{h'}(x_{h'})) \mid x_h=x,a_h=a\Big]\)
-
PROBLEM SETTING
-
Since state,action and the horizon are all finite, there exists an optimal policy \( \pi^\star\)
PROBLEM SETTING
-
Bellman Equation and Optimality Equation
where \( [\mathbb{P}_hV_{h+1}](x,a) = E_{x' \sim \mathbb{P_h}(.|x,a)}\Big[ V_{h+1}(x')\Big]\)
PROBLEM SETTING
-
The agent plays the game for K episodes
-
So, the total number of steps, \(T = K*H\)
-
-
For each episode, an adversary picks the starting state \(x_1^k\) and the agent picks a policy \(\pi_k\)
-
Total expected Regret is defined as
\(Regret(K) = \sum_{k=1}^K \Big[ V_1^\star(x_1^k) - V_1^{\pi_k}(x_1^k) \Big] \)
BACKGROUND
-
Reinforcement Learning (RL) is a control-theoretic problem.
-
The agent tries to maximize its cumulative rewards via interacting with an unknown environment.
-
-
Two main approaches to RL:
-
Model-based algorithms: Learn a model for the environment by interactions, and generate a control policy based on this learned model.
-
Model free algorithms: Don’t learn the model. Instead, directly update the value function or the policy.
(Define Value function and policy?)
-
BACKGROUND
-
Probably approximately correct (PAC) learning theory
-
helps analyze whether and under what conditions a learner L will probably output an approximately correct classifier.
-
-
Approximate?
-
A learner is approximately correct if \(error_D(L) < \epsilon \) where D is the distribution over inputs.
-
-
Probably?
-
If L will output the correct classifier with probability \(1−\delta \) where \(0≤\delta≤0.5\), then L is probably approximately correct.
-
BACKGROUND
-
Probably approximately correct (PAC) learning theory
-
helps analyze whether and under what conditions a learner L will probably output an approximately correct classifier.
-
-
So, relation with sample size, m?
Question for PAC learning in RL?
BACKGROUND
-
Sample Complexity - intuitively, it means "how many examples are required to guarantee a probably approximately correct solution" (PAC Solution)
-
How does this relate to regret?
-
Intuitively:
-
Compute the regret - It will be a function of number of samples.
-
Goal is to minimize the regret. So, take derivative with respect to sample size and find the optimal value for it!
-
THEORY
-
Prior work in the multi-arm bandit setting has shown that the choice of exploration policy plays an essential role in the efficiency of a learning algorithm.
- To achieve good sample efficiency, manage the tradeoff between exploration and exploitation
THEORY
-
In episodic MDP, Q-learning with the commonly used ε-greedy exploration strategy can be very inefficient
- It can take exponentially many episodes to learn
THEORY
Model-based RL
THEORY
-
This work’s main theoretical result— A sample complexity result for variants of Q-learning that incorporates UCB based exploration!
-
Two proposed algorithms are:
-
Q-learning with UCB-Hoeffding
-
Q-learning with UCB-Bernstein
-
THEORY
-
Hoeffding inequality
- Intuition: provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount
THEORY
-
Bernstein inequality
- Intuition: provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount
ALGORITHM 1: Q-learning with UCB-Hoeffding
THEORY
-
Theorem 1 shows that under a rather simple choice of exploration bonus, Q-learning can be made very efficient, enjoying a \(O(\sqrt T)\) regret.
- This is the first analysis of a model-free procedure that features a \(\sqrt T\) regret without requiring access to a “simulator.”
THEORY
-
Their regret slightly increases the dependency on H. However,
-
the algorithm is online
-
does not store additional data besides the table of Q values
-
-
Regret of Q-learning (UCB-H) is as good as the best model-based one
-
Same dependency on S, A and T.
-
ALGORITHM 2: Q-learning with UCB-Bernstein
In Q-learning (UCB-H), \(b_t\) is
and
THEORY
-
Theorem 2 shows that for Q-learning with UCB-B exploration, the leading term in regret scales as \(\sqrt T\).
-
It also improves by a factor of \(\sqrt H\) over UCB-H exploration, at the cost of computing a more complicated exploration bonus term.
-
-
Theorem 2 has an additive term in its regret, which dominates the total regret when T is not very large compared with S, A and H.
THEORY
-
Regret of UCB-B is only one \(\sqrt H\) factor worse than the best regret achieved by model-based algorithms. However,
-
the algorithm is online
-
does not store additional data besides the table of Q values
-
-
Regret of Q-learning (UCB-B) is as good as the best model-based one
-
Same dependency on S, A and T.
-
THEORY
-
Theorem 3 shows that both variants of their algorithm are nearly optimal!
-
They only differ from the optimal regret by a factor of H and \(\sqrt H\)
-
RESULTS
RESULTS - THEOREM 1 PROOF
Step 1: Define new terms and intermediate variables
RESULTS - THEOREM 1 PROOF
Step 1: Define new terms and intermediate variables
Why such a value for \(\alpha_t\)? : to ensure regret is not exponential in H
RESULTS - THEOREM 1 PROOF
Step 2: Define Lemmas
(Intuition: Just defining intermediate rules or properties of the variables that can be used for simplification)
RESULTS - THEOREM 1 PROOF
Step 2: Define Lemmas
(Intuition: A lemma that gives a recursive formula for \(Q\) − \(Q^\star\) , as a weighted average of previous updates.)
Proof?
RESULTS - THEOREM 1 PROOF
Step 2: Define Lemmas
(Intuition: lemma shows that \(Q^k\) is always an upper bound on \(Q^\star\) at any episode k, and \(Q\) − \(Q^\star\) can be bounded by quantities from the next step
Proof?
RESULTS - THEOREM 1 PROOF
Step 3: Use Lemmas and other definitions to prove the Theorem
Proof? - The main idea of the rest of the proof is to upper bound \(\delta_h^k\) by values from the next step, \(\delta_{h+1}^k\).
RESULTS - THEOREM 1 PROOF
Step 3: Use Lemmas and other definitions to prove the Theorem
RESULTS - THEOREM 1 PROOF
Step 3: Use Lemmas and other definitions to prove the Theorem
CRITIQUE
-
They did not discuss how this analysis can be extended to other MDP settings.
-
Why choose an episodic MDP?
-
-
No mention of any future work in the paper.
-
Or intuition for extension to problems with continuous state and actions
-
-
Establishing connection between regret and sample efficiency should have been done in the earlier sections of the paper.
IMPACT AND LEGACY
FUTURE WORK and ADDITIONAL READING
-
Reading:
-
If you found this interesting, and want to read similar work, then you should look check this out - https://sites.google.com/view/cjin/publications?authuser=0
-
Also, this paper - "When Is Partially Observable Reinforcement Learning Not Scary?"
-
-
Future Work:
I am interested in seeing if the same theoretical results can be generated to deep RL techniques.
CONTRIBUTIONS (RECAP)
-
They showed that model-free algorithms that are sample-efficient can be designed.
-
Q-learning, when equipped with a UCB exploration policy that incorporates estimates of the confidence of Q values and assigns exploration bonuses, achieves total regret O( √H3SAT).
-
S and A are the numbers of states and actions, H is the number of steps per episode, and T is the total number of steps.
-
Q-learning is online and has a significant advantage over model-based algorithms in terms of time and space complexities.
-
-
This is the first theoretical analysis for model-free algorithms—featuring √T regret or equivalently O(1/ε2) samples for ε-optimal policy.
CONTRIBUTIONS (RECAP)
-
They showed that model-free algorithms that are sample-efficient can be designed.
-
The policy obtained using vanilla Q-learning with a UCB exploration policy and an additional bonus term is sample efficient.
-
Q-learning with \(\epsilon\)-greedy exploration policy is not
-
-
This paper presented the first-ever theoretical analysis on sample complexity and regret for model-free algorithms
-
\(\sqrt T\) regret or
-
\(O(1/\epsilon^2)\) samples for ε-optimal policy
-
ASEN_6519_DMU++_Paper Presentation_#2
By Himanshu Gupta
ASEN_6519_DMU++_Paper Presentation_#2
- 47