ASEN 6519 - DMU ++ Paper Presentation

Himanshu Gupta

Date - 11 October 2023

Jin et al. (NeurIPS 2018)

Is Q-learning Provably Efficient?

MOTIVATION

Should I use Deep RL to solve complex sequential decision-making problems

Model-free RL approaches are more prevalent than model-based RL approaches for deep RL.
- Online
- Requires less space
- More expressive and flexible

People I met at a robotics conference over the summer :
- "Hell to the Yeah!"

MOTIVATION

However, it has been shown "empirically" that model-free algorithms suffer from a higher sample complexity than model-based approaches.
To train a physical robot for a simple task, a Model-based method may take about 20 minutes while a Policy Gradient method may take weeks.

Source: https://jonathan-hui.medium.com/rl-model-based-reinforcement-learning-3c2b6f0aa323

MOTIVATION

No theoretical result or analysis to support or explain this empirical observation.

That leads to the theoretical question,
“Can we design model-free algorithms that are sample efficient?“

Should we combine model-free approaches with model-based approaches?

Answer unknown (until this paper came out)
Question was unanswered even for problems with finitely many states and actions

CONTRIBUTION

They showed that model-free algorithms that are sample-efficient can be designed.
The policy obtained using vanilla Q-learning with a UCB exploration policy and an additional bonus term is sample efficient.
- Q-learning with \(\epsilon\)-greedy exploration policy is not
This paper presented the first-ever theoretical analysis on sample complexity and regret for model-free algorithms
- \(\sqrt T\) regret or
- \(O(1/\epsilon^2)\) samples for ε-optimal policy

Problem Setting\(\rightarrow\) Background

PROBLEM SETTING

They considered a tabular episodic Markov Decision Process
- MDP (\(S,A,H,\mathbb{P},r\))
- \(H\) : Number of steps in each episode
- \(\mathbb{P}\): State transition matrix
  - \(\mathbb{P}_h(.|x,a)\)
- \(r_h: S \times A \rightarrow [0,1]\)
Policy \( \pi \) - Collection of \(H\) functions
- \( \{ \pi_h : S \rightarrow A \}_{h \in [H]} \)

PROBLEM SETTING

State value function at step h under policy \(\pi\)
- \(V_h^{\pi}: S \rightarrow \mathbb{R}\)
- \(V_h^{\pi}(x) = E\Big[\sum_{h'=h}^H r_{h'}(x_{h'}, \pi_{h'}(x_{h'})) \mid x_h=x\Big]\)
State-Action value function at step h under policy \(\pi\)
- \(Q_h^{\pi}: S \times A \rightarrow \mathbb{R}\)
- \(Q_h^{\pi}(x,a) = r_h(x,a) + E\Big[\sum_{h'=h}^H r_{h'}(x_{h'}, \pi_{h'}(x_{h'})) \mid x_h=x,a_h=a\Big]\)

PROBLEM SETTING

Since state,action and the horizon are all finite, there exists an optimal policy \( \pi^\star\)

PROBLEM SETTING

Bellman Equation and Optimality Equation

where \( [\mathbb{P}_hV_{h+1}](x,a) = E_{x' \sim \mathbb{P_h}(.|x,a)}\Big[ V_{h+1}(x')\Big]\)

PROBLEM SETTING

The agent plays the game for K episodes
- So, the total number of steps, \(T = K*H\)
For each episode, an adversary picks the starting state \(x_1^k\) and the agent picks a policy \(\pi_k\)
Total expected Regret is defined as

\(Regret(K) = \sum_{k=1}^K \Big[ V_1^\star(x_1^k) - V_1^{\pi_k}(x_1^k) \Big] \)

BACKGROUND

Reinforcement Learning (RL) is a control-theoretic problem.
- The agent tries to maximize its cumulative rewards via interacting with an unknown environment.
Two main approaches to RL:
- Model-based algorithms: Learn a model for the environment by interactions, and generate a control policy based on this learned model.
- Model free algorithms: Don’t learn the model. Instead, directly update the value function or the policy.
  
  (Define Value function and policy?)

BACKGROUND

Probably approximately correct (PAC) learning theory
- helps analyze whether and under what conditions a learner L will probably output an approximately correct classifier.
Approximate?
- A learner is approximately correct if \(error_D(L) < \epsilon \) where D is the distribution over inputs.
Probably?
- If L will output the correct classifier with probability \(1−\delta \) where \(0≤\delta≤0.5\), then L is probably approximately correct.

BACKGROUND

Probably approximately correct (PAC) learning theory
- helps analyze whether and under what conditions a learner L will probably output an approximately correct classifier.
So, relation with sample size, m?

Question for PAC learning in RL?

BACKGROUND

Sample Complexity - intuitively, it means "how many examples are required to guarantee a probably approximately correct solution" (PAC Solution)
How does this relate to regret?

Intuitively:
- Compute the regret - It will be a function of number of samples.
- Goal is to minimize the regret. So, take derivative with respect to sample size and find the optimal value for it!

THEORY

Prior work in the multi-arm bandit setting has shown that the choice of exploration policy plays an essential role in the efficiency of a learning algorithm.

To achieve good sample efficiency, manage the tradeoff between exploration and exploitation

THEORY

In episodic MDP, Q-learning with the commonly used ε-greedy exploration strategy can be very inefficient

It can take exponentially many episodes to learn

THEORY

Model-based RL

THEORY

This work’s main theoretical result— A sample complexity result for variants of Q-learning that incorporates UCB based exploration!
Two proposed algorithms are:
- Q-learning with UCB-Hoeffding
- Q-learning with UCB-Bernstein

THEORY

Hoeffding inequality

Intuition: provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount

THEORY

Bernstein inequality

Intuition: provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount

ALGORITHM 1: Q-learning with UCB-Hoeffding

THEORY

Theorem 1 shows that under a rather simple choice of exploration bonus, Q-learning can be made very efficient, enjoying a \(O(\sqrt T)\) regret.

This is the first analysis of a model-free procedure that features a \(\sqrt T\) regret without requiring access to a “simulator.”

THEORY

Their regret slightly increases the dependency on H. However,
- the algorithm is online
- does not store additional data besides the table of Q values

Regret of Q-learning (UCB-H) is as good as the best model-based one
- Same dependency on S, A and T.

ALGORITHM 2: Q-learning with UCB-Bernstein

In Q-learning (UCB-H), \(b_t\) is

and

THEORY

Theorem 2 shows that for Q-learning with UCB-B exploration, the leading term in regret scales as \(\sqrt T\).
- It also improves by a factor of \(\sqrt H\) over UCB-H exploration, at the cost of computing a more complicated exploration bonus term.
Theorem 2 has an additive term in its regret, which dominates the total regret when T is not very large compared with S, A and H.

THEORY

Regret of UCB-B is only one \(\sqrt H\) factor worse than the best regret achieved by model-based algorithms. However,
- the algorithm is online
- does not store additional data besides the table of Q values

Regret of Q-learning (UCB-B) is as good as the best model-based one
- Same dependency on S, A and T.

THEORY

Theorem 3 shows that both variants of their algorithm are nearly optimal!
- They only differ from the optimal regret by a factor of H and \(\sqrt H\)

RESULTS

RESULTS - THEOREM 1 PROOF

Step 1: Define new terms and intermediate variables

RESULTS - THEOREM 1 PROOF

Step 1: Define new terms and intermediate variables

Why such a value for \(\alpha_t\)? : to ensure regret is not exponential in H

RESULTS - THEOREM 1 PROOF

Step 2: Define Lemmas

(Intuition: Just defining intermediate rules or properties of the variables that can be used for simplification)

RESULTS - THEOREM 1 PROOF

Step 2: Define Lemmas

(Intuition: A lemma that gives a recursive formula for \(Q\) − \(Q^\star\) , as a weighted average of previous updates.)

Proof?

RESULTS - THEOREM 1 PROOF

Step 2: Define Lemmas

(Intuition: lemma shows that \(Q^k\) is always an upper bound on \(Q^\star\) at any episode k, and \(Q\) − \(Q^\star\) can be bounded by quantities from the next step

Proof?

RESULTS - THEOREM 1 PROOF

Step 3: Use Lemmas and other definitions to prove the Theorem

Proof? - The main idea of the rest of the proof is to upper bound \(\delta_h^k\) by values from the next step, \(\delta_{h+1}^k\).

RESULTS - THEOREM 1 PROOF

Step 3: Use Lemmas and other definitions to prove the Theorem

RESULTS - THEOREM 1 PROOF

Step 3: Use Lemmas and other definitions to prove the Theorem

CRITIQUE

They did not discuss how this analysis can be extended to other MDP settings.
- Why choose an episodic MDP?
No mention of any future work in the paper.
- Or intuition for extension to problems with continuous state and actions
Establishing connection between regret and sample efficiency should have been done in the earlier sections of the paper.

IMPACT AND LEGACY

FUTURE WORK and ADDITIONAL READING

Reading:
- If you found this interesting, and want to read similar work, then you should look check this out - https://sites.google.com/view/cjin/publications?authuser=0
- Also, this paper - "When Is Partially Observable Reinforcement Learning Not Scary?"
Future Work:

I am interested in seeing if the same theoretical results can be generated to deep RL techniques.

CONTRIBUTIONS (RECAP)

They showed that model-free algorithms that are sample-efficient can be designed.
Q-learning, when equipped with a UCB exploration policy that incorporates estimates of the confidence of Q values and assigns exploration bonuses, achieves total regret O( √H3SAT).
- S and A are the numbers of states and actions, H is the number of steps per episode, and T is the total number of steps.
- Q-learning is online and has a significant advantage over model-based algorithms in terms of time and space complexities.
This is the first theoretical analysis for model-free algorithms—featuring √T regret or equivalently O(1/ε2) samples for ε-optimal policy.

CONTRIBUTIONS (RECAP)

They showed that model-free algorithms that are sample-efficient can be designed.
The policy obtained using vanilla Q-learning with a UCB exploration policy and an additional bonus term is sample efficient.
- Q-learning with \(\epsilon\)-greedy exploration policy is not
This paper presented the first-ever theoretical analysis on sample complexity and regret for model-free algorithms
- \(\sqrt T\) regret or
- \(O(1/\epsilon^2)\) samples for ε-optimal policy