Reinforcement Learning

Last Time

  • What tools do we have to solve MDPs with continuous \(S\) and \(A\)?

Course Map

  • Outcome Uncertainty, Immediate vs Future Rewards (MDP)
  • Model Uncertainty (Reinforcement Learning)
  • State Uncertainty (POMDP)
  • Interaction Uncertainty (Game)

Guiding Questions

  • What is Reinforcement Learning?
  • What are the main challenges in Reinforcement Learning?
  • How do we categorize RL approaches?

Problem from HW2

Reinforcement Learning

Previously: \((S, A, T, R, \gamma)\)

r = act!(env, a)
s = observe(env)

Note: Different from \(s', r = G(s, a)\)

In python, typically

s, r = env.step(a)

Now: Episodic Simulator

Learning Curve

Break

Challenges

  1. Exploration vs Exploitation
  2. Credit Assignment
  3. Generalization

Classifications

  • Model Based: Attempt to learn \(T\) and \(R\), then find \(\pi^*\) by solving MDP
  • Model Free: Attempt to find \(Q^*\) or \(\pi^*\) directly
  • On-Policy: The exploration policy is the same as the learned policy.
  • Off-Policy: The exploration policy may be different than the learned policy.
  • Batch: Learn only from previously-generated experience (no exploration policy).
  • Tabular: Keep track of learned values for each state in a table
  • Deep: Use a neural network to approximate learned values

Tabular Maximum Likelihood Model-Based RL

\(\textbf{Given}\) \(\text{env}\), \(S\), \(A\)
\(N[s, a, s'] \leftarrow 0 \quad \forall s, a, s'\)
\(\rho[s, a] \leftarrow 0 \quad \forall s, a\)
\(s \leftarrow \text{observe}(\text{env})\)
\(\pi \leftarrow \text{random policy}\)
\(\textbf{loop}\)
    \(\text{reset!(env)}\)
    \(\textbf{while}\text{ not terminated(env)}\)
        \(a \leftarrow \begin{cases} \text{rand}(A) & \text{w.p. } \varepsilon \\ \pi(s) & \text{w.p. } 1-\varepsilon \end{cases}\)
        \(r \leftarrow \text{act!}(\text{env}, a)\)
        \(s' \leftarrow \text{observe}(\text{env})\)
        \(N[s, a, s'] \mathrel{+}= 1\)
        \(\rho[s, a] \mathrel{+}= r\)
        \(s \leftarrow s'\)
    \(T^a[s, s'] \leftarrow \dfrac{N[s,a,s']}{\sum_{s'} N[s,a,s']} \quad \forall s,a,s'\)
    \(R^a[s] \leftarrow \dfrac{\rho[s,a]}{\sum_{s'} N[s,a,s']} \quad \forall s,a\)
    \(\pi \leftarrow \text{solve}((S, A, T, R, \gamma))\)

Guiding Questions

  • What is Reinforcement Learning?
  • What are the main challenges in Reinforcement Learning?
  • How do we categorize RL approaches?

090 Reinforcement Learning

By Zachary Sunberg

090 Reinforcement Learning

  • 495