Tabular Maximum Likelihood Model-Based Reinforcement Learning
First RL Algorithm:
loop
choose action \(a\)
gain experience
estimate \(T\), \(R\)
solve MDP with \(T\), \(R\)
According to Peter Whittle, ‘‘efforts to solve [bandit problems] so sapped the energies and
minds of Allied analysts that the suggestion was made that the problem be dropped over Germany as the ultimate instrument of intellectual sabotage.’’
\(\rho_a = \frac{\text{number of wins}+1}{\text{number of tries}+1}\)
Choose \(\underset{a}{\text{argmax}} \, \rho_a\)
(remove gap with \(\lambda \to \infty\))
Discuss with your neighbor: Suppose you have the following belief about the parameters \(\theta\). Which arm should you choose to pull next?
\(\theta_i\)
\(P(\theta_i)\)
Bernoulli Distribution
\(\text{Bernoulli}(\theta)\)
Discussion: Given that I have received \(w\) wins and \(l\) losses, what should my belief (probability distribution) about \(\theta\) look like?
Bernoulli Distribution
\(\text{Bernoulli}(\theta)\)
Beta Distribution
(distribution over Bernoulli distributions)
\(\text{Beta}(\alpha, \beta)\)
Given a \(\text{Beta}(1,1)\) prior distribution
The posterior distribution of \(\theta\) is \(\text{Beta}(w+1, l+1)\)
\(t\) = time
\(a\) = arm pulled
\(r\) = reward
\(\alpha = 0.9\)