ASEN 6519 - DMU ++ Paper Presentation

Himanshu Gupta

Date - 11 October 2023

Jin et al. (NeurIPS 2018)

Is Q-learning Provably Efficient?

MOTIVATION

  • Should I use Deep RL to solve complex sequential decision-making problems

  • Model-free RL approaches are more prevalent than model-based RL approaches for deep RL.

    • Online

    • Requires less space

    • More expressive and flexible

  • People I met at a robotics conference over the summer :  

    • "Hell to the Yeah!"

MOTIVATION

  • However, it has been shown "empirically" that model-free algorithms suffer from a higher sample complexity than model-based approaches.

     
  • To train a physical robot for a simple task, a Model-based method may take about 20 minutes while a Policy Gradient method may take weeks.

MOTIVATION

  • No theoretical result or analysis to support or explain this empirical observation.

  • That leads to the theoretical question,
    “Can we design model-free algorithms that are sample efficient?“

  • Should we combine model-free approaches with model-based approaches?
  • Answer unknown (until this paper came out)

  • Question was unanswered even for problems with finitely many states and actions

CONTRIBUTION

  • They showed that model-free algorithms that are sample-efficient can be designed.

     

  • The policy obtained using vanilla Q-learning with a UCB exploration policy and an additional bonus term is sample efficient.

    • Q-learning with \(\epsilon\)-greedy exploration policy is not


       
  • This paper presented the first-ever theoretical analysis on sample complexity and regret for model-free algorithms

    • \(\sqrt T\) regret or  

    • \(O(1/\epsilon^2)\) samples for ε-optimal policy

Problem Setting\(\rightarrow\) Background

PROBLEM SETTING

  • They considered a tabular episodic Markov Decision Process

    • MDP (\(S,A,H,\mathbb{P},r\)) 

    • \(H\) : Number of steps in each episode 

    • \(\mathbb{P}\): State transition matrix

      • \(\mathbb{P}_h(.|x,a)\)

    • \(r_h: S \times A \rightarrow [0,1]\)
       

  • Policy \( \pi \) - Collection of \(H\) functions

    • \( \{ \pi_h : S \rightarrow A  \}_{h \in [H]} \)

PROBLEM SETTING

  • State value function at step h under policy \(\pi\)

    • \(V_h^{\pi}: S \rightarrow \mathbb{R}\) 

    • \(V_h^{\pi}(x) =  E\Big[\sum_{h'=h}^H r_{h'}(x_{h'}, \pi_{h'}(x_{h'})) \mid x_h=x\Big]\)

       

  • State-Action value function at step h under policy \(\pi\)

    • \(Q_h^{\pi}: S \times A \rightarrow \mathbb{R}\) 

    • \(Q_h^{\pi}(x,a) = r_h(x,a) + E\Big[\sum_{h'=h}^H r_{h'}(x_{h'}, \pi_{h'}(x_{h'})) \mid x_h=x,a_h=a\Big]\)

PROBLEM SETTING

  • Since state,action and the horizon are all finite, there exists an optimal policy \( \pi^\star\)

PROBLEM SETTING

  • Bellman Equation and Optimality Equation

where \( [\mathbb{P}_hV_{h+1}](x,a) =  E_{x' \sim \mathbb{P_h}(.|x,a)}\Big[ V_{h+1}(x')\Big]\)

PROBLEM SETTING

  • The agent plays the game for K episodes 

    • So, the total number of steps, \(T = K*H\)
       

  • For each episode, an adversary picks the starting state \(x_1^k\) and the agent picks a policy \(\pi_k\)
     

  • Total expected Regret is defined as

    \(Regret(K) = \sum_{k=1}^K \Big[ V_1^\star(x_1^k) - V_1^{\pi_k}(x_1^k) \Big] \)

BACKGROUND

  • Reinforcement Learning (RL) is a control-theoretic problem.

    • The agent tries to maximize its cumulative rewards via interacting with an unknown environment.
       

  • Two main approaches to RL:

    • Model-based algorithms:  Learn a model for the environment by interactions, and  generate a control policy based on this learned model.
       

    • Model free algorithms: Don’t learn the model. Instead, directly update the value function or the policy.

      (Define Value function and policy?)

BACKGROUND

  • Probably approximately correct (PAC) learning theory

    • helps analyze whether and under what conditions a learner L will probably output an approximately correct classifier.
       

  • Approximate?

    • A learner is approximately correct if \(error_D(L) < \epsilon \) where D is the distribution over inputs.
       

  • Probably?

    • If L will output the correct classifier with probability \(1−\delta \) where \(0≤\delta≤0.5\), then L is probably approximately correct.

       

BACKGROUND

  • Probably approximately correct (PAC) learning theory

    • helps analyze whether and under what conditions a learner L will probably output an approximately correct classifier.
       

  • So, relation with sample size, m?

Question for PAC learning in RL?

BACKGROUND

  • Sample Complexity - intuitively, it means "how many examples are required to guarantee a probably approximately correct solution" (PAC Solution)

     

  • How does this relate to regret?

  • Intuitively:

    • Compute the regret - It will be a function of number of samples.

    • Goal is to minimize the regret. So, take derivative with respect to sample size and find the optimal value for it!

THEORY

  • Prior work in the multi-arm bandit setting has shown that the choice of exploration policy plays an essential role in the efficiency of a learning algorithm.

  • To achieve good sample efficiency, manage the tradeoff between exploration and exploitation

THEORY

  • In episodic MDP, Q-learning with the commonly used ε-greedy exploration strategy can be very inefficient

  • It can take exponentially many episodes to learn

THEORY

Model-based RL

THEORY

  • This work’s main theoretical result— A sample complexity result for variants of Q-learning that incorporates UCB based exploration!


     

  • Two proposed algorithms are:

    • Q-learning with UCB-Hoeffding

    • Q-learning with UCB-Bernstein

THEORY

  • Hoeffding inequality

  • Intuition: provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount

THEORY

  • Bernstein inequality

  • Intuition: provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount

ALGORITHM 1:  Q-learning with UCB-Hoeffding

THEORY

  • Theorem 1 shows that under a rather simple choice of exploration bonus, Q-learning can be made very efficient, enjoying a \(O(\sqrt T)\) regret.

  • This is the first analysis of a model-free procedure that features a \(\sqrt T\) regret without requiring access to a “simulator.”

THEORY

  • Their regret slightly increases the dependency on H. However,

    • the algorithm is online

    • does not store additional data besides the table of Q values

  • Regret of Q-learning (UCB-H) is as good as the best model-based one

    • Same dependency on S, A and T.

ALGORITHM 2: Q-learning with UCB-Bernstein

In Q-learning (UCB-H), \(b_t\) is

and

THEORY

  • Theorem 2 shows that for Q-learning with UCB-B exploration, the leading term in regret scales as \(\sqrt T\).

    • It also improves by a factor of \(\sqrt H\) over UCB-H exploration, at the cost of computing a more complicated exploration bonus term.
       

  • Theorem 2 has an additive term in its regret, which dominates the total regret when T is not very large compared with S, A and H. 

THEORY

  • Regret of UCB-B is only one \(\sqrt H\) factor worse than the best regret achieved by model-based algorithms. However,

    • the algorithm is online

    • does not store additional data besides the table of Q values

  • Regret of Q-learning (UCB-B) is as good as the best model-based one

    • Same dependency on S, A and T.

THEORY

  • Theorem 3 shows that both variants of their algorithm are nearly optimal!

    • They only differ from the optimal regret by a factor of H and \(\sqrt H\)

RESULTS

RESULTS - THEOREM 1 PROOF

Step 1: Define new terms and intermediate variables

RESULTS - THEOREM 1 PROOF

Step 1: Define new terms and intermediate variables

Why such a value for \(\alpha_t\)? : to ensure regret is not exponential in H

RESULTS - THEOREM 1 PROOF

Step 2: Define Lemmas 

(Intuition: Just defining intermediate rules or properties of the variables that can be used for simplification)

RESULTS - THEOREM 1 PROOF

Step 2: Define Lemmas 

(Intuition: A lemma that gives a recursive formula for \(Q\) − \(Q^\star\) , as a weighted average of previous updates.)

Proof?

RESULTS - THEOREM 1 PROOF

Step 2: Define Lemmas 

(Intuition: lemma shows that \(Q^k\) is always an upper bound on \(Q^\star\)  at any episode k, and \(Q\) − \(Q^\star\) can be bounded by quantities from the next step

Proof?

RESULTS - THEOREM 1 PROOF

Step 3: Use Lemmas and other definitions to prove the Theorem 

Proof? - The main idea of the rest of the proof is to upper bound \(\delta_h^k\) by values from the next step, \(\delta_{h+1}^k\).

RESULTS - THEOREM 1 PROOF

Step 3: Use Lemmas and other definitions to prove the Theorem 

RESULTS - THEOREM 1 PROOF

Step 3: Use Lemmas and other definitions to prove the Theorem 

CRITIQUE

  • They did not discuss how this analysis can be extended to other MDP settings.

    • Why choose an episodic MDP?

       

  • No mention of any future work in the paper.

    • Or intuition for extension to problems with continuous state and actions

       

  • Establishing connection between regret and sample efficiency should have been done in the earlier sections of the paper.

IMPACT AND LEGACY

FUTURE WORK and ADDITIONAL READING

CONTRIBUTIONS (RECAP)

  • They showed that model-free algorithms that are sample-efficient can be designed.
     

  • Q-learning, when equipped with a UCB exploration policy that incorporates estimates of the confidence of Q values and assigns exploration bonuses, achieves total regret O( √H3SAT). 

    • S and A are the numbers of states and actions, H is the number of steps per episode, and T is the total number of steps. 

    • Q-learning is online and has a significant advantage over model-based algorithms in terms of time and space complexities.
       

  • This is the first theoretical analysis for model-free algorithms—featuring √T regret or equivalently O(1/ε2) samples for ε-optimal policy.

CONTRIBUTIONS (RECAP)

  • They showed that model-free algorithms that are sample-efficient can be designed.

     

  • The policy obtained using vanilla Q-learning with a UCB exploration policy and an additional bonus term is sample efficient.

    • Q-learning with \(\epsilon\)-greedy exploration policy is not


       
  • This paper presented the first-ever theoretical analysis on sample complexity and regret for model-free algorithms

    • \(\sqrt T\) regret or  

    • \(O(1/\epsilon^2)\) samples for ε-optimal policy

ASEN_6519_DMU++_Paper Presentation_#2

By Himanshu Gupta

ASEN_6519_DMU++_Paper Presentation_#2

  • 47