Value-Based Model Free RL

Last Time

  • Policy Optimization
  • Policy Gradient
  • Tricks for Policy Gradient

Map

Today

  • Basic On- and Off-Policy value based model free RL algorithms
  • Tricks for tabular value based RL algorithms
  • Understanding of On- vs Off-Policy

Why learn Q?

Incremental Mean Estimation

\[\hat{x}_m = \frac{1}{m} \sum_{i=1}^m x^{(i)}\]

\[= \frac{1}{m} \left( x^{(m)} + \sum_{i=1}^{m-1} x^{(i)} \right)\]

\[= \frac{1}{m} \left( x^{(m)} + (m-1) \, \hat{x}_{m-1} \right)\]

\[= \hat{x}_{m-1} + \frac{1}{m} \left( x^{(m)} - \hat{x}_{m-1}  \right)\]

  loop

\[\hat{x} \gets \hat{x} + \alpha \left( x - \hat{x} \right)\]

"Temporal Difference (TD) Error"

Q Learning

Q Learning

Want: \(Q(s, a) \gets Q(s, a) + \alpha \left(\hat{q} - Q(s, a)\right)\)

Want: \(Q(s, a) \gets Q(s, a) + \alpha \left(\hat{q}(s, a, r, s') - Q(s, a)\right)\)

\[Q(s, a) = R(s, a) + \gamma E[V(s')]\]

\[= R(s, a) + \gamma E\left[\max_{a'} Q(s', a')\right]\]

\[= E\left[r + \gamma \max_{a'} Q(s', a')\right]\]

Q learning and SARSA

\(Q(s, a) \gets 0\)

\(s \gets s_0\)

loop

    \(a \gets \text{argmax} \, Q(s, a) \, \text{w.p.} \, 1-\epsilon, \quad \text{rand}(A) \, \text{o.w.}\)

    \(r \gets \text{act!}(\text{env}, a)\)

    \(s' \gets \text{observe}(\text{env})\)

    \(Q(s, a) \gets Q(s, a) + \alpha\, \left(r + \gamma \max_{a'} Q(s', a') - Q(s, a)\right)\)

    \(s \gets s'\)

Q-Learning

SARSA

Q-learning: \(Q(s, a) \gets Q(s, a) + \alpha\, \left(r + \gamma \max_{a'} Q(s', a') - Q(s, a)\right)\)

SARSA: \(Q(s, a) \gets Q(s, a) + \alpha\, \left(r + \gamma Q(s', a') - Q(s, a)\right)\)

Illustrative Problem

  1. After a few episodes, what is Q(3, a) for a in 1-4?
  2. After a few episodes, what is Q(1, L)?
  3. Why is this a problem and what are some possible solutions?

Big Problem: Maximization Bias

Even if all \(Q(s', a')\) unbiased, \(\max_{a'} Q(s', a')\) is biased!

Solution: Double Q Learning

\(Q_1\), \(Q_2\)

\[Q_1(s, a) \gets Q_1(s, a) + \alpha\, \left(r + \gamma \, Q_2\left(s', \underset{a'}{\text{argmax}} Q_1(s', a')\right) - Q_1(s, a)\right)\]

Eligibility Traces

SARSA-λ

\(Q(s, a), N(s, a) \gets 0\)

\(\text{initialize}\, s, a, r, s'\)

loop

    \(a' \gets \text{argmax} \, Q(s', a) \, \text{w.p.} \, 1-\epsilon, \quad \text{rand}(A) \, \text{o.w.}\)

    \(N(s, a) \gets N(s, a) + 1\)

    \(\delta \gets r + \gamma Q(s', a') - Q(s, a)\)
    \(Q(s, a) \gets Q(s, a) + \alpha\delta \, N(s, a) \quad \forall s, a\)

    \(N(s, a) \gets \gamma \lambda N(s, a)\)

    \(s \gets s'\),   \(a \gets a'\)

    \(r \gets \text{act!}(\text{env}, a)\)

    \(s' \gets \text{observe}(\text{env})\)

SARSA-λ

\(Q(s, a), N(s, a) \gets 0\)

\(\text{initialize}\, s, a, r, s'\)

loop

    \(a' \gets \text{argmax} \, Q(s', a) \, \text{w.p.} \, 1-\epsilon, \quad \text{rand}(A) \, \text{o.w.}\)

    \(N(s, a) \gets N(s, a) + 1\)

    \(\delta \gets r + \gamma Q(s', a') - Q(s, a)\)
    \(Q(s, a) \gets Q(s, a) + \alpha\delta \, N(s, a) \quad \forall s, a\)

    \(N(s, a) \gets \gamma \lambda N(s, a)\)

    \(s \gets s'\),   \(a \gets a'\)

    \(r \gets \text{act!}(\text{env}, a)\)

    \(s' \gets \text{observe}(\text{env})\)

Convergence

  • Q learning converges to optimal Q-values w.p. 1
    (Sutton and Barto, p. 131)
  • SARSA converges to optimal Q-values w.p. 1 provided that \(\pi \to \text{greedy}\)
    (Sutton and Barto, p. 129)

On vs Off-Policy

On Policy

Off Policy

Q-learning:

\(Q(s, a) \gets Q(s, a) + \alpha\, \left(r + \gamma \max_{a'} Q(s', a') - Q(s, a)\right)\)

SARSA:

\(Q(s, a) \gets Q(s, a) + \alpha\, \left(r + \gamma Q(s', a') - Q(s, a)\right)\)

Will eligibility traces work with Q-learning?

Not easily

Policy Gradient:

\(\theta \gets \theta + \alpha \sum_{k=0}^d \nabla_\theta \log \pi_\theta(a_k \mid s_k) R(\tau) \)

Today

  • Basic On- and Off-Policy value based model free RL algorithms
  • Tricks for tabular value based RL algorithms
  • Understanding of On- vs Off-Policy

120 Value Based Model Free RL

By Zachary Sunberg

120 Value Based Model Free RL

  • 211