Reinforcement Learning
Last Time
-
What tools do we have to solve MDPs with continuous \(S\) and \(A\)?
Course Map
- Outcome Uncertainty, Immediate vs Future Rewards (MDP)
- Model Uncertainty (Reinforcement Learning)
- State Uncertainty (POMDP)
- Interaction Uncertainty (Game)
Guiding Questions
- What is Reinforcement Learning?
- What are the main challenges in Reinforcement Learning?
- How do we categorize RL approaches?
Problem from HW2


Reinforcement Learning
Previously: \((S, A, T, R, \gamma)\)
r = act!(env, a)
s = observe(env)
Note: Different from \(s', r = G(s, a)\)
In python, typically
s, r = env.step(a)
Now: Episodic Simulator
Learning Curve
Break
Challenges
- Exploration vs Exploitation
- Credit Assignment
- Generalization
Classifications
- Model Based: Attempt to learn \(T\) and \(R\), then find \(\pi^*\) by solving MDP
- Model Free: Attempt to find \(Q^*\) or \(\pi^*\) directly
- On-Policy: The exploration policy is the same as the learned policy.
- Off-Policy: The exploration policy may be different than the learned policy.
- Batch: Learn only from previously-generated experience (no exploration policy).
- Tabular: Keep track of learned values for each state in a table
- Deep: Use a neural network to approximate learned values
Tabular Maximum Likelihood Model-Based RL
\(\textbf{Given}\) \(\text{env}\), \(S\), \(A\)
\(N[s, a, s'] \leftarrow 0 \quad \forall s, a, s'\)
\(\rho[s, a] \leftarrow 0 \quad \forall s, a\)
\(s \leftarrow \text{observe}(\text{env})\)
\(\pi \leftarrow \text{random policy}\)
\(\textbf{loop}\)
\(\text{reset!(env)}\)
\(\textbf{while}\text{ not terminated(env)}\)
\(a \leftarrow \begin{cases} \text{rand}(A) & \text{w.p. } \varepsilon \\ \pi(s) & \text{w.p. } 1-\varepsilon \end{cases}\)
\(r \leftarrow \text{act!}(\text{env}, a)\)
\(s' \leftarrow \text{observe}(\text{env})\)
\(N[s, a, s'] \mathrel{+}= 1\)
\(\rho[s, a] \mathrel{+}= r\)
\(s \leftarrow s'\)
\(T^a[s, s'] \leftarrow \dfrac{N[s,a,s']}{\sum_{s'} N[s,a,s']} \quad \forall s,a,s'\)
\(R^a[s] \leftarrow \dfrac{\rho[s,a]}{\sum_{s'} N[s,a,s']} \quad \forall s,a\)
\(\pi \leftarrow \text{solve}((S, A, T, R, \gamma))\)
Guiding Questions
- What is Reinforcement Learning?
- What are the main challenges in Reinforcement Learning?
- How do we categorize RL approaches?
090 Reinforcement Learning
By Zachary Sunberg
090 Reinforcement Learning
- 495