Exploration and Exploitation (Bandits)
Last Time
- What is Reinforcement Learning?
- What are the main challenges in Reinforcement Learning?
- How do we categorize RL approaches?
Last Time
Tabular Maximum Likelihood Model-Based Reinforcement Learning
First RL Algorithm:
loop
choose action \(a\)
gain experience
estimate \(T\), \(R\)
solve MDP with \(T\), \(R\)
Guiding Questions
- What are the best ways to trade off Exploration and Exploitation?
Bandits
According to Peter Whittle, ‘‘efforts to solve [bandit problems] so sapped the energies and
minds of Allied analysts that the suggestion was made that the problem be dropped over Germany as the ultimate instrument of intellectual sabotage.’’
- Bernoulli Bandit with parameters \(\theta\)
- \(\theta^* \equiv \max \theta\)
Greedy Strategy
\(\rho_a = \frac{\text{number of wins}+1}{\text{number of tries}+1}\)
Choose \(\underset{a}{\text{argmax}} \, \rho_a\)
Undirected Strategies
- Explore then Commit
Choose \(a\) randomly for \(k\) steps
Then choose \(\underset{a}{\text{argmax}} \, \rho_a\) - \(\epsilon\) - greedy
With probability \(\epsilon\), choose randomly
Otherwise choose \(\underset{a}{\text{argmax}} \, \rho_a\)
Directed Strategies
- Softmax
Choose \(a\) with probability proportional to \(e^{\lambda \rho_a}\)
- Upper Confidence Bound (UCB)
Choose \(\underset{a}{\text{argmax}} \, \rho_a + c\,\sqrt{\frac{\log{N}}{N(a)}}\)
(remove gap with \(\lambda \to \infty\))
Break
Discuss with your neighbor: Suppose you have the following belief about the parameters \(\theta\). Which arm should you choose to pull next?
\(\theta_i\)
\(P(\theta_i)\)
Bayesian Estimation
Bernoulli Distribution
\(\text{Bernoulli}(\theta)\)
Discussion: Given that I have received \(w\) wins and \(l\) losses, what should my belief (probability distribution) about \(\theta\) look like?
Bayesian Estimation
Bernoulli Distribution
\(\text{Bernoulli}(\theta)\)
Beta Distribution
(distribution over Bernoulli distributions)
\(\text{Beta}(\alpha, \beta)\)
Bayesian Estimation
Given a \(\text{Beta}(1,1)\) prior distribution
The posterior distribution of \(\theta\) is \(\text{Beta}(w+1, l+1)\)
Bayesian Estimation
\(t\) = time
\(a\) = arm pulled
\(r\) = reward
Bayesian Bandit Algorithms
- Quantile Selection
Choose \(a\) for which the \(\alpha\) quantile of \(b(\theta)\) is highest
- Thompson Sampling
Sample \(\hat{\theta}\)
Choose \(\underset{a}{\text{argmax}} \, \hat{\theta}_a\)
\(\alpha = 0.9\)
Optimal Algorithm - Dynamic Programming
Review
Guiding Questions
- What are the best ways to trade off Exploration and Exploitation
100 Exploration vs Exploitation (Bandits)
By Zachary Sunberg
100 Exploration vs Exploitation (Bandits)
- 209