\[(I, S, A, T, R, \gamma)\]
Calculated with Dynamic Programming [Rhoads and Bartholdi, 2012]
This is not a game payoff matrix!
Loop:
1. Simulate with \(\pi^{BR}\)
2. Update \(N(j, a^j, s)\) (\(N\) should reflect *all* past simulations)
2. \(\pi^j (a^j \mid s) \propto N(j, a^j, s) \quad \forall j\)
3. \(\pi^{BR} \gets\) best response to \(\pi\)
\(\pi\) (not necessarily \(\pi^{BR}\)) converges to a Nash equilibrium in some cases, notably 2-player zero-sum games
Terminology
Minimax Tree
MDP Expectimax Tree
\[V(s) = \text{max}_{a \in \mathcal{A}}\left(R(s, a) + \text{E}[V(s')]\right)\]
\[V(s) = \text{max}_{a \in \mathcal{A_1}}\left(R(s, a) + \text{min}_{a' \in A_2} (R(s', a') + V(s''))\right)\]
Why is this harder than an MDP? (think back to sparse sampling)
Note: the above example does not follow UCB exploration
(Alternative to POMGs that is more common in the literature)
(Alternative to POMGs that is more common in the literature)
Extensive-form game definition (\(h\) is a sequence of actions called a "history"):
Exponential in number of info states!
This slide not covered on exam
Heinrich et al. 2015 "Fictitious Self Play in Extensive-Form Games"