Exploration and Exploitation (Bandits)

Last Time

  • What is Reinforcement Learning?
  • What are the main challenges in Reinforcement Learning?
  • How do we categorize RL approaches?

Last Time

Tabular Maximum Likelihood Model-Based Reinforcement Learning

First RL Algorithm:

loop

    choose action \(a\)

    gain experience

    estimate \(T\), \(R\)

    solve MDP with \(T\), \(R\)

 

Guiding Questions

  • What are the best ways to trade off Exploration and Exploitation?

Bandits

According to Peter Whittle, ‘‘efforts to solve [bandit problems] so sapped the energies and
minds of Allied analysts that the suggestion was made that the problem be dropped over Germany as the ultimate instrument of intellectual sabotage.’’
  • Bernoulli Bandit with parameters \(\theta\)
  • \(\theta^* \equiv \max \theta\)

Greedy Strategy

\(\rho_a = \frac{\text{number of wins}+1}{\text{number of tries}+1}\)

Choose \(\underset{a}{\text{argmax}} \, \rho_a\)

Undirected Strategies

  • Explore then Commit
    Choose \(a\) randomly for \(k\) steps
    Then choose \(\underset{a}{\text{argmax}} \, \rho_a\)
  • \(\epsilon\) - greedy
    With probability \(\epsilon\), choose randomly
    Otherwise choose \(\underset{a}{\text{argmax}} \, \rho_a\)

Directed Strategies

  • Softmax
    Choose \(a\) with probability proportional to \(e^{\lambda \rho_a}\)
     
  • Upper Confidence Bound (UCB)
    Choose \(\underset{a}{\text{argmax}} \, \rho_a + c\,\sqrt{\frac{\log{N}}{N(a)}}\)

(remove gap with \(\lambda \to \infty\))

Break

Discuss with your neighbor: Suppose you have the following belief about the parameters \(\theta\). Which arm should you choose to pull next?

\(\theta_i\)

\(P(\theta_i)\)

Bayesian Estimation

Bernoulli Distribution

\(\text{Bernoulli}(\theta)\)

Discussion: Given that I have received \(w\) wins and \(l\) losses, what should my belief (probability distribution) about \(\theta\) look like?

Bayesian Estimation

Bernoulli Distribution

\(\text{Bernoulli}(\theta)\)

Beta Distribution

(distribution over Bernoulli distributions)

\(\text{Beta}(\alpha, \beta)\)

Bayesian Estimation

Given a \(\text{Beta}(1,1)\) prior distribution

The posterior distribution of \(\theta\) is \(\text{Beta}(w+1, l+1)\)

Bayesian Estimation

\(t\) = time

\(a\) = arm pulled

\(r\) = reward

Bayesian Bandit Algorithms

  • Quantile Selection
    Choose \(a\) for which the \(\alpha\) quantile of \(b(\theta)\) is highest
     
  • Thompson Sampling
    Sample \(\hat{\theta}\)
    Choose \(\underset{a}{\text{argmax}} \, \hat{\theta}_a\)

\(\alpha = 0.9\)

Optimal Algorithm - Dynamic Programming

Review

Guiding Questions

  • What are the best ways to trade off Exploration and Exploitation

100 Exploration vs Exploitation (Bandits)

By Zachary Sunberg

100 Exploration vs Exploitation (Bandits)

  • 209