Exploration and Exploitation (Bandits)

Last Time

What is Reinforcement Learning?
What are the main challenges in Reinforcement Learning?
How do we categorize RL approaches?

Last Time

Tabular Maximum Likelihood Model-Based Reinforcement Learning

First RL Algorithm:

loop

choose action \(a\)

gain experience

estimate \(T\), \(R\)

solve MDP with \(T\), \(R\)

Guiding Questions

What are the best ways to trade off Exploration and Exploitation?

Bandits

According to Peter Whittle, ‘‘efforts to solve [bandit problems] so sapped the energies and
minds of Allied analysts that the suggestion was made that the problem be dropped over Germany as the ultimate instrument of intellectual sabotage.’’

Bernoulli Bandit with parameters \(\theta\)
\(\theta^* \equiv \max \theta\)

Greedy Strategy

\(\rho_a = \frac{\text{number of wins}+1}{\text{number of tries}+1}\)

Choose \(\underset{a}{\text{argmax}} \, \rho_a\)

Undirected Strategies

Explore then Commit
Choose \(a\) randomly for \(k\) steps
Then choose \(\underset{a}{\text{argmax}} \, \rho_a\)
\(\epsilon\) - greedy
With probability \(\epsilon\), choose randomly
Otherwise choose \(\underset{a}{\text{argmax}} \, \rho_a\)

Directed Strategies

Softmax
Choose \(a\) with probability proportional to \(e^{\lambda \rho_a}\)
Upper Confidence Bound (UCB)
Choose \(\underset{a}{\text{argmax}} \, \rho_a + c\,\sqrt{\frac{\log{N}}{N(a)}}\)

(remove gap with \(\lambda \to \infty\))

Break

Discuss with your neighbor: Suppose you have the following belief about the parameters \(\theta\). Which arm should you choose to pull next?

\(\theta_i\)

\(P(\theta_i)\)

Bayesian Estimation

Bernoulli Distribution

\(\text{Bernoulli}(\theta)\)

Discussion: Given that I have received \(w\) wins and \(l\) losses, what should my belief (probability distribution) about \(\theta\) look like?

Bayesian Estimation

Bernoulli Distribution

\(\text{Bernoulli}(\theta)\)

Beta Distribution

(distribution over Bernoulli distributions)

\(\text{Beta}(\alpha, \beta)\)

Bayesian Estimation

Given a \(\text{Beta}(1,1)\) prior distribution

The posterior distribution of \(\theta\) is \(\text{Beta}(w+1, l+1)\)

Bayesian Estimation

\(t\) = time

\(a\) = arm pulled

\(r\) = reward

Bayesian Bandit Algorithms

Quantile Selection
Choose \(a\) for which the \(\alpha\) quantile of \(b(\theta)\) is highest
Thompson Sampling
Sample \(\hat{\theta}\)
Choose \(\underset{a}{\text{argmax}} \, \hat{\theta}_a\)

\(\alpha = 0.9\)

Optimal Algorithm - Dynamic Programming

Review

Guiding Questions

What are the best ways to trade off Exploration and Exploitation