[Deep Reinforcement Learning Nanodegree Chapter 1] Foundations of Reinforcement Learning

1. Foundations of Reinforcement Learning

Official course description

The first part begins with a simple introduction to reinforcement learning. You’ll learn how to define real-world problems as Markov Decision Processes (MDPs), so that they can be solved with reinforcement learning.

How might we use reinforcement learning to teach a robot to walk?

How might we use reinforcement learning to teach a robot to walk? (Source)

Then, you’ll implement classical methods such as SARSA and Q-learning to solve several environments in OpenAI Gym. You’ll then explore how to use techniques such as tile coding and coarse coding to expand the size of the problems that can be solved with traditional reinforcement learning algorithms.

Train a car to navigate a steep hill using Q-learning.

Train a car to navigate a steep hill using Q-learning.

[toc]

Foundations of Reinforcement Learning

Reinforcement learning(RL)

  • Building code that can learn to perform complex tasks by itself

Applications

  • Games: AlphaGo, Atari breakout, DOTA
  • Self-driving: Car(Uber, Google), Ship, Airplane
  • Robotics: Walking robots

Terminologies

  • Agent: learner or decision-maker, born into the world w/o any understanding of how anything works
  • The agent learned to interact with the environment
  • Feedback: Rewards(Positive feedback) or discoursing feedback
  • Goal: Maximize rewards

Exploration-Exploitation Dilemma

  • Exploration: Exploring potential hypotheses for how to choose actions

  • Exploitation: Exploiting limited knowledge about what is already known should work well

  • Balancing these competing requirements

Reinforcement learning framework

Setting

image

  • State(Observation, ): the environment presents a situation to the agent
  • Action(): appropriate actions in response
  • Reward(): One-time step later, the agent receives
  • The goal of the Agent: Maximize expected cumulative reward

Episodic task & Continuing task

  • Task: an instance of the reinforcement learning problem
  • Episodic task
    • Tasks with a well-defined starting and ending point
    • There is a specific ending point(Terminal state)
    • An Episode means that interaction ends at some time step
    • The reward is given at the ending point
  • Continuing task
    • Tasks that continue forever, without end
    • Interaction continues without limit

Rewards

  • Reward Hypothesis: All goals can be framed as the maximization of expected cumulative reward

  • Cumulative reward(return):
    • At the time step , the agent picks to maximize (expected)
  • Discounted reward(return):
    • discount rate

Markov Decision Process (MDP)

img

  • A (finite) MDP is defined by
    • a (finite) set of states
    • a (finite) set of actions
    • a (finite) set of rewards
    • the one-step dynamics of the environment for all
    • a discount rate

Policies

Policies
  • A policy determines how an agent chooses an action in response to the current state
  • It specifies how the agent responds to situations that the environment has presented.

  • Deterministic policy()
    • Return a specific action by state
  • Stochastic policy(
    • Return probabilities of action set by state and action set
Optimal Policies
  1. State-Value Function
    • The value of state under a policy
    • For each state , it yields the expected return() if the agent starts in state s() and then uses policy() to choose its actions for all time steps.

    • Bellman Expectation Equation
      • The equation to calculate the value of any state is the sum of the immediate reward and the discounted value of the state that follows.
  2. Action-Value Function

    • The value of taking action in the state under a policy
  3. Optimality
    • Compare with two policies
      • if and only if
    • The optimal policy satisfies
    • Once the agent determines the optimal action-value function , it can quickly obtain an optimal policy by setting .

Monte Carlo Methods

  • Equiprobable random policy: the agent chooses an action in the action set with the same probabilities.

  • The action-value function with a Q-table
    • To find an optimal policy, the agent tries many episodes.
    • Q-table is the expected value matrix by states and actions based on the results of episodes.
      • Each episode creates a matrix, and the final Q-table has average values of these matrixes.
  • MC Prediction
    • Every-visit MC Prediction: Fill out the Q-table with the average value of observations
    • First-visit MC Prediction: Fill out the Q-table with the value of the first observation

MC Control

  • Control Problem: Estimate the optimal policy
  • MC Control Method: a solution for the control problem
    • Step 1: Using the policy to construct the Q-table
    • Step 2: improving the policy by changing it to be -greedy with respect to the Q-table ()
    • Eventually, obtain the optimal policy

Greedy Policies

  • Greedy Policies
    • Collect episodes with and estimate the Q-table
    • Initial policy is the equiprobable random policy
    • Use the Q-table to find a better policy and set
    • Iterate these steps
  • Epsilon-Greedy Policies
    • Greedy policy always selects the greedy action
    • Epsilon-Greedy policy is most likely selects the greedy action
    • Set a specific value
    • is a probability that the agent selects random actions instead of the greedy action

Exploration-Exploitation Dilemma

  • Exploration: Prefer to select randomly
  • Exploitation: Prefer to select the greedy action
  • Greedy in the Limit with Infinite Exploration(GLIE)
    • The conditions to guarantee that MC control converges to the optimal policy
    • Every state-action pair (for all and ) is visited infinitely many times
      • The agent continues to explore (never stop to explore)
    • The policy converges to a policy that is greedy with respect to the action-value function estimate
      • The agent gradually exploits more and explores less

Incremental Mean

  • To reduce time, update Q-table after every episode.
  • Update Q-table could be used to improve the policy
  • Problem
    • (The number of steps in the episode) is bigger and bigger, the effect of denotation() is smaller and smaller
    • Every state-action pair is not effected to evenly

Constant-alpha

  • To solve the problem of Incremental Mean, uses constant value, instead of
    • if , then the Q-table is never updated → Exploitation if , then the previous result(previous $Q$) is always ignored → Exploration

Temporal-Difference Methods

  • Monte Carlo methods need to end the interaction
    • In the self-driving car, MC methods update the policy after the crash
  • Temporal-Difference methods update the policy every step
  • On-policy TD control methods (like Expected Sarsa and Sarsa) have better online performance than off-policy TD control methods (like Q-learning).
  • Expected Sarsa generally achieves better performance than Sarsa.

TD Control

  • It’s similar to MC constant-alpha
  • The value is updated in every step instead of after the episode ends
  • In the MC, Q-value is updated to sum of future rewards when the episode is ended. But in the TD, Q-value is udpated to sum of current rewards and next Q-value(expected sum of future reward after ) when every single stemp
    • It means that the agent considers next action’s Q-value as the sum of future reward
    • MC: Update the effect of denotation , where is the cumulated rewards after time
    • TD (Sarsa): Using instead of

Sarsa (Sarsa(0))

Q-Learning(Sarsamax)

  • In the Sarsa, uses to estimate future rewards
  • In the Q-Learning, uses

Expected Sarsa

  • Using expected rewards (probability of actions x expected rewards) instead of maximum rewards in the Q-learning

Mini-Project(OpenAI Gym Taxi)

Deep Reinforcement Learning

Overview of Reinforcement Learning in Discrete spaces

  • Finite Markov Decision Processes (MDPs), reinforcement learning environments where the number of states and actions is limited, is possible to represent the action-value function with a table, dictionary, or other finite structure.

  • But what about MDPs with much larger spaces? Consider that the Q-table must have a row for each state. For instance, if there are 10 million possible states, the Q-table must have 10 million rows. Furthermore, if the state space is the set of continuous real-valued numbers (an infinite set!), it becomes impossible to represent the action values in a finite structure!

  • Markov Decision Processes (MDPs):
    • State Transition and Reward Model:
    • State Value Function:
    • Action Value Function:
    • Goal: Find an optimal policy that maximizes the total expected reward
  • Reinforcement Learning Algorithms
    • Model-Based Learning (Dynamic Programming)
      • Policy Iteration
      • Value Iteration
    • Model-Free Learning
      • Monte Carlo Methods
      • Temporal-Difference Methods
  • Deep Reinforcement Learning
    • RL in Continuous Spaces
    • Deep Q-Learning
    • Policy Gradients
    • Actor-Critic Methods

Continuous Spaces

  • Discretization
    • Continuous spaces → Discrete spaces
    • Non-Uniform Discretization
    • Tile Coding, Coarse Coding
      • Using Multiple Q-tables
      • Greedy action is the action has a maximum average Q-value
      • Tile Coding using multiple layers by rectangles, Coarse Coding using multiple layers by circles
  • Function Approximation
    • Linear Function Approximation
    • Non-Linear Function Approximation