1. Foundations of Reinforcement Learning

Official course description

The first part begins with a simple introduction to reinforcement learning. You’ll learn how to define real-world problems as Markov Decision Processes (MDPs), so that they can be solved with reinforcement learning.

How might we use reinforcement learning to teach a robot to walk? (Source)

Then, you’ll implement classical methods such as SARSA and Q-learning to solve several environments in OpenAI Gym. You’ll then explore how to use techniques such as tile coding and coarse coding to expand the size of the problems that can be solved with traditional reinforcement learning algorithms.

Train a car to navigate a steep hill using Q-learning.

[toc]

Foundations of Reinforcement Learning

Reinforcement learning(RL)

Building code that can learn to perform complex tasks by itself

Applications

Games: AlphaGo, Atari breakout, DOTA
Self-driving: Car(Uber, Google), Ship, Airplane
Robotics: Walking robots

Terminologies

Agent: learner or decision-maker, born into the world w/o any understanding of how anything works
The agent learned to interact with the environment
Feedback: Rewards(Positive feedback) or discoursing feedback
Goal: Maximize rewards

Exploration-Exploitation Dilemma

Exploration: Exploring potential hypotheses for how to choose actions
Exploitation: Exploiting limited knowledge about what is already known should work well
Balancing these competing requirements

Reinforcement learning framework

Setting

State(Observation, $S_t$ ): the environment presents a situation to the agent
Action( $A_t$ ): appropriate actions in response
Reward( $R_t$ ): One-time step later, the agent receives
The goal of the Agent: Maximize expected cumulative reward

Episodic task & Continuing task

Task: an instance of the reinforcement learning problem
Episodic task
- Tasks with a well-defined starting and ending point
- There is a specific ending point(Terminal state)
- An Episode means that interaction ends at some time step $T$
- The reward is given at the ending point
Continuing task
- Tasks that continue forever, without end
- Interaction continues without limit

Rewards

Reward Hypothesis: All goals can be framed as the maximization of expected cumulative reward
Cumulative reward(return):
- At the time step $t$ , the agent picks $A_t$ to maximize (expected) $G_t$
Discounted reward(return):
- discount rate $\gamma \in [0, 1]$

Markov Decision Process (MDP)

A (finite) MDP is defined by
- a (finite) set of states $S$
- a (finite) set of actions $A$
- a (finite) set of rewards $R$
- the one-step dynamics of the environment $p(s',r|s,a) \doteq \mathbb{P}(S_{t+1}=s', R_{t+1}=r|S_t = s, A_t=a)$ for all $s, s`, a and r$
- a discount rate $\gamma \in [0, 1]$

Policies

A policy determines how an agent chooses an action in response to the current state
It specifies how the agent responds to situations that the environment has presented.
Deterministic policy()
- Return a specific action by state
Stochastic policy(
- $\pi(a|s) = \mathbb{P}(A_t = a | S_t = s)$
- Return probabilities of action set by state and action set

Optimal Policies

State-Value Function
- The value of state $s$ under a policy $\pi$
- $v_\pi(s) = \mathbb{E}_\pi[G_t|S_t=s]$
- For each state $s$ , it yields the expected return( $G_t$ ) if the agent starts in state s( $S_t=s$ ) and then uses policy( $\pi$ ) to choose its actions for all time steps.
- Bellman Expectation Equation
  - The equation to calculate the value of any state is the sum of the immediate reward and the discounted value of the state that follows.
  - $v_\pi(s) = \mathbb{E}_\pi[R_{t+1} + \gamma v_\pi(S_{t+1}|S_t = s)]$
Action-Value Function
- The value of taking action $a$ in the state $s$ under a policy $\pi$
- $q_\pi(s, a) = \mathbb{E}_\pi[G_t|S_t=s,A_t=a]$
- $v_\pi(s) = q_\pi(s, \pi(s))$ if $\pi$ is a deterministic policy and all $s \in \mathcal{S}$
Optimality
- Compare with two policies
  - $\pi' \ge \pi$ if and only if $v_{\pi'}(s) \ge v_\pi(s)\text{ for all }s \in \mathcal{S}$
- The optimal policy $\pi_\star$ satisfies $\pi_\star \ge \pi \text{ for all }\pi$
- Once the agent determines the optimal action-value function $q_*$ , it can quickly obtain an optimal policy $\pi_*$ by setting $\pi_*(s) = \arg\max_{a\in\mathcal{A}(s)} q_*(s,a)$ .

Monte Carlo Methods

Equiprobable random policy: the agent chooses an action in the action set with the same probabilities.
The action-value function with a Q-table
- To find an optimal policy, the agent tries many episodes.
- Q-table is the expected value matrix by states and actions based on the results of episodes.
  - Each episode creates a matrix, and the final Q-table has average values of these matrixes.
MC Prediction
- Every-visit MC Prediction: Fill out the Q-table with the average value of observations
- First-visit MC Prediction: Fill out the Q-table with the value of the first observation

MC Control

Control Problem: Estimate the optimal policy
MC Control Method: a solution for the control problem
- Step 1: Using the policy $\pi$ to construct the Q-table
- Step 2: improving the policy by changing it to be $\epsilon$ -greedy with respect to the Q-table ( $\pi' \leftarrow \epsilon\text{-greedy}(Q), \pi \leftarrow \pi'$ )
- Eventually, obtain the optimal policy $\pi_*$

Greedy Policies

Greedy Policies
- Collect episodes with $\pi$ and estimate the Q-table
- Initial policy is the equiprobable random policy
- Use the Q-table to find a better policy $\pi'$ and set $\pi \leftarrow \pi'$
- Iterate these steps
Epsilon-Greedy Policies
- Greedy policy always selects the greedy action
- Epsilon-Greedy policy is most likely selects the greedy action
- Set a specific value $\epsilon (0\le\epsilon\le1)$
- $\epsilon$ is a probability that the agent selects random actions instead of the greedy action

Exploration-Exploitation Dilemma

Exploration: Prefer to select randomly
Exploitation: Prefer to select the greedy action
Greedy in the Limit with Infinite Exploration(GLIE)
- The conditions to guarantee that MC control converges to the optimal policy $\pi_*$
- Every state-action pair (for all and ) is visited infinitely many times
  - The agent continues to explore (never stop to explore)
  - $\epsilon_t > 0$
- The policy converges to a policy that is greedy with respect to the action-value function estimate
  - The agent gradually exploits more and explores less
  - $\lim_{t\rightarrow \infty}\epsilon_t = 0$

Incremental Mean

To reduce time, update Q-table after every episode.
Update Q-table could be used to improve the policy
- $Q \leftarrow Q + {1 \over N}(G - Q)$
Problem
- $N$ (The number of steps in the episode) is bigger and bigger, the effect of denotation( $G-Q$ ) is smaller and smaller
- Every state-action pair is not effected to $Q$ evenly

Constant-alpha

To solve the problem of Incremental Mean, uses constant value, instead of
- $Q \leftarrow Q + \alpha(G-Q)$ $Q \leftarrow (1-\alpha)Q + \alpha G$
- if $\alpha = 0$ , then the Q-table is never updated → Exploitation if $\alpha = 1$ , then the previous result(previous $Q$) is always ignored → Exploration

Temporal-Difference Methods

Monte Carlo methods need to end the interaction
- In the self-driving car, MC methods update the policy after the crash
Temporal-Difference methods update the policy every step
On-policy TD control methods (like Expected Sarsa and Sarsa) have better online performance than off-policy TD control methods (like Q-learning).
Expected Sarsa generally achieves better performance than Sarsa.

TD Control

It’s similar to MC constant-alpha
The value is updated in every step instead of after the episode ends
In the MC, Q-value is updated to sum of future rewards when the episode is ended. But in the TD, Q-value is udpated to sum of current rewards and next Q-value(expected sum of future reward after $t+1$ ) when every single stemp
- It means that the agent considers next action’s Q-value as the sum of future reward
- MC: Update the effect of denotation $G-Q$ , where $G$ is the cumulated rewards after time $t$
- TD (Sarsa): Using $R_{t+1} + \gamma Q_{t+1}$ instead of $G$

Sarsa (Sarsa(0))

$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha(R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t))$

Q-Learning(Sarsamax)

In the Sarsa, uses $Q(S_{t+1}, A_{t+1})$ to estimate future rewards
In the Q-Learning, uses $max_{a \in \mathcal{A}} Q(S_{t+1}, a)$
$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha(R_{t+1} + \gamma max_{a \in \mathcal{A}} Q(S_{t+1}, a) - Q(S_t, A_t))$

Expected Sarsa

Using expected rewards (probability of actions x expected rewards) instead of maximum rewards in the Q-learning
$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha(R_{t+1} + \gamma \sum_{a \in \mathcal{A}} \pi (a|S_{t+1}) Q(S_{t+1}, a) - Q(S_t, A_t))$

Mini-Project(OpenAI Gym Taxi)

Deep Reinforcement Learning

Overview of Reinforcement Learning in Discrete spaces

Finite Markov Decision Processes (MDPs), reinforcement learning environments where the number of states and actions is limited, is possible to represent the action-value function with a table, dictionary, or other finite structure.
But what about MDPs with much larger spaces? Consider that the Q-table must have a row for each state. For instance, if there are 10 million possible states, the Q-table must have 10 million rows. Furthermore, if the state space is the set of continuous real-valued numbers (an infinite set!), it becomes impossible to represent the action values in a finite structure!
Markov Decision Processes (MDPs):
- State Transition and Reward Model: $\mathbb{P}(S_{t+1}, R_{t+1}|S_t, A_t)$
- State Value Function: $V(S)$
- Action Value Function: $Q(S,A)$
- Goal: Find an optimal policy $\pi_*$ that maximizes the total expected reward
Reinforcement Learning Algorithms
- Model-Based Learning (Dynamic Programming)
  - Policy Iteration
  - Value Iteration
- Model-Free Learning
  - Monte Carlo Methods
  - Temporal-Difference Methods
Deep Reinforcement Learning
- RL in Continuous Spaces
- Deep Q-Learning
- Policy Gradients
- Actor-Critic Methods

Continuous Spaces

Discretization
- Continuous spaces → Discrete spaces
- Non-Uniform Discretization
- Tile Coding, Coarse Coding
  - Using Multiple Q-tables
  - Greedy action is the action has a maximum average Q-value
  - Tile Coding using multiple layers by rectangles, Coarse Coding using multiple layers by circles
Function Approximation
- Linear Function Approximation
- Non-Linear Function Approximation

Nanodegree Deep Reinforcement Learning

1. Foundations of Reinforcement Learning

Foundations of Reinforcement Learning

Reinforcement learning(RL)

Applications

Terminologies

Exploration-Exploitation Dilemma

Reinforcement learning framework

Setting

Episodic task & Continuing task

Rewards

Markov Decision Process (MDP)

Policies

Policies

Optimal Policies

Monte Carlo Methods

MC Control

Greedy Policies

Exploration-Exploitation Dilemma

Incremental Mean

Constant-alpha

Temporal-Difference Methods

TD Control

Sarsa (Sarsa(0))

Q-Learning(Sarsamax)

Expected Sarsa

Mini-Project(OpenAI Gym Taxi)

Deep Reinforcement Learning

Overview of Reinforcement Learning in Discrete spaces

Continuous Spaces

Related Posts

[Deep Reinforcement Learning Nanodegree Chapter 4] Multi-Agent Reinforcement Learning 15 Dec 2020

[Deep Reinforcement Learning Nanodegree Chapter 3] Policy-Based Methods 10 Dec 2020

[Deep Reinforcement Learning Nanodegree Chapter 2] Value-Based Methods 27 Nov 2020