Capstone Project: Reinforcement Learning from Scratch with Python

What will we cover?

We will learn what Reinforcement Learning is and how it works. Then by using Object-Oriented Programming technics (more about Object-Oriented Programming), we implement a Reinforcement Model to solve the problem of figuring out where to pick up and drop of item on a field.

Step 1: What is Reinforcement Learning?

Reinforcement Learning is one of the 3 main categories of Machine Learning (get started with Machine Learning here) and is concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward.

How Reinforcement Learning works

Reinforcement Learning teaches the machine to think for itself based on past action rewards.

  • Basically, the Reinforcement Learning algorithm tries to predict actions that gives rewards and avoids punishment.
  • It is like training a dog. You and the dog do not talk the same language, but the dogs learns how to act based on rewards (and punishment, which I do not advise or advocate).
  • Hence, if a dog is rewarded for a certain action in a given situation, then next time it is exposed to a similar situation it will act the same.
  • Translate that to Reinforcement Learning.
    • The agent is the dog that is exposed to the environment.
    • Then the agent encounters a state.
    • The agent performs an action to transition to a new state.
    • Then after the transition the agent receives a reward or penalty (punishment).
    • This forms a policy to create a strategy to choose actions in a given state.

What algorithms are used for Reinforcement Learning?

  • The most common algorithm for Reinforcement Learning are.
    • Q-Learning: is a model-free reinforcement learning algorithm to learn a policy telling an agent what action to take under what circumstances.
    • Temporal Difference: refers to a class of model-free reinforcement learning methods which learn by bootstrapping from the current estimate of the value function.
    • Deep Adversarial Network: is a technique employed in the field of machine learning which attempts to fool models through malicious input.
  • We will focus on the Q-learning algorithm as it is easy to understand as well as powerful.

How does the Q-learning algorithm work?

  • As already noted, I just love this algorithm. It is “easy” to understand and powerful as you will see.
  • The Q-Learning algorithm has a Q-table (a Matrix of dimension state x actions – don’t worry if you do not understand what a Matrix is, you will not need the mathematical aspects of it – it is just an indexed “container” with numbers).
  • The agent (or Q-Learning algorithm) will be in a state.
  • Then in each iteration the agent needs take an action.
  • The agent will continuously update the reward in the Q-table.
  • The learning can come from either exploiting or exploring.
  • This translates into the following pseudo algorithm for the Q-Learning.
  • The agent is in a given stateºº and needs to choose an action.

Algorithm

  • Initialise the Q-table to all zeros
  • Iterate
    • Agent is in state state.
    • With probability epsilon choose to explore, else exploit.
      • If explore, then choose a random action.
      • If exploit, then choose the best action based on the current Q-table.
    • Update the Q-table from the new reward to the previous state.
    • Q[state, action] = (1 – alpha) * Q[state, action] + alpha * (reward + gamma * max(Q[new_state]) — Q[state, action])

Variables

As you can se, we have introduced the following variables.

  • epsilon: the probability to take a random action, which is done to explore new territory.
  • alpha: is the learning rate that the algorithm should make in each iteration and should be in the interval from 0 to 1.
  • gamma: is the discount factor used to balance the immediate and future reward. This value is usually between 0.8 and 0.99
  • reward: is the feedback on the action and can be any number. Negative is penalty (or punishment) and positive is a reward.

Step 2: The problem we want to solve

Here we have a description of task we want to solve.

  • To keep it simple, we create a field of size 10×10 positions. In that field there is an item that needs to be picked up and moved to a drop-off point.
  • At each position there are 6 different actions that can be taken.
    • Action 0: Go South if on field.
    • Action 1: Go North if on field.
    • Action 2: Go East if on field (Please notice, I mixed up East and West (East is Left here)).
    • Action 3: Go West if on field (Please notice, I mixed up East and West (West is right here)).
    • Action 4: Pickup item (it can try even if it is not there)
    • Action 5: Drop-off item (it can try even if it does not have it)
  • Based on these actions we will make a reward system.
    • If the agent tries to go off the field, punish with -10 in reward.
    • If the agent makes a (legal) move, punish with -1 in reward, as we do not want to encourage endless walking around.
    • If the agent tries to pick up item, but it is not there or it has it already, punish with -10 in reward.
    • If the agent picks up the item correct place, reward with 20.
    • If agent tries to drop-off item in wrong place or does not have the item, punish with -10 in reward.
    • If the agent drops-off item in correct place, reward with 20.
  • That translates into the following code. I prefer to implement this code, as I think the standard libraries that provide similar frameworks hide some important details. As an example, and shown later, how do you map this into a state in the Q-table?

Step 3: Implementing the field

First we need a way to represent the field, representing the environment our model lives in. This is defined in Step 2 and could be implemented as follows.

class Field:
    def __init__(self, size, item_pickup, item_drop_off, start_position):
        self.size = size
        self.item_pickup = item_pickup
        self.item_drop_off = item_drop_off
        self.position = start_position
        self.item_in_car = False
        
    def get_number_of_states(self):
        return self.size*self.size*self.size*self.size*2
    
    def get_state(self):
        state = self.position[0]*self.size*self.size*self.size*2
        state = state + self.position[1]*self.size*self.size*2
        state = state + self.item_pickup[0]*self.size*2
        state = state + self.item_pickup[1]*2
        if self.item_in_car:
            state = state + 1
        return state
        
    def make_action(self, action):
        (x, y) = self.position
        if action == 0:  # Go South
            if y == self.size - 1:
                return -10, False
            else:
                self.position = (x, y + 1)
                return -1, False
        elif action == 1:  # Go North
            if y == 0:
                return -10, False
            else:
                self.position = (x, y - 1)
                return -1, False
        elif action == 2:  # Go East
            if x == 0:
                return -10, False
            else:
                self.position = (x - 1, y)
                return -1, False
        elif action == 3:  # Go West
            if x == self.size - 1:
                return -10, False
            else:
                self.position = (x + 1, y)
                return -1, False
        elif action == 4:  # Pickup item
            if self.item_in_car:
                return -10, False
            elif self.item_pickup != (x, y):
                return -10, False
            else:
                self.item_in_car = True
                return 20, False
        elif action == 5:  # Drop off item
            if not self.item_in_car:
                return -10, False
            elif self.item_drop_off != (x, y):
                self.item_pickup = (x, y)
                self.item_in_car = False
                return -10, False
            else:
                return 20, True

Step 4: A Naive approach to solve it (NON-Machine Learning)

A naive approach would to just take random actions and hope for the best. This is obviously not optimal, but nice to have as a base line to compare with.

def naive_solution():
    size = 10
    item_start = (0, 0)
    item_drop_off = (9, 9)
    start_position = (0, 9)
    
    field = Field(size, item_start, item_drop_off, start_position)
    done = False
    steps = 0
    
    while not done:
        action = random.randint(0, 5)
        reward, done = field.make_action(action)
        steps = steps + 1
    
    return steps

To make an estimate on how many steps it takes you can run this code.

runs = [naive_solution() for _ in range(100)]
print(sum(runs)/len(runs))

Where we use List Comprehension (learn more about list comprehension). This gave 143579.21. Notice, you most likely will get something different, as there is a high level of randomness involved.

Step 5: Implementing our Reinforcement Learning Model

Here we give the algorithm for what we need to implement.

Algorithm

  • Initialise the Q-table to all zeros
  • Iterate
    • Agent is in state state.
    • With probability epsilon choose to explore, else exploit.
      • If explore, then choose a random action.
      • If exploit, then choose the best action based on the current Q-table.
    • Update the Q-table from the new reward to the previous state.
    • Q[state, action] = (1 – alpha) * Q[state, action] + alpha * (reward + gamma * max(Q[new_state]) — Q[state, action])

Then we end up with the following code to train our Q-table.

size = 10
item_start = (0, 0)
item_drop_off = (9, 9)
start_position = (0, 9)

field = Field(size, item_start, item_drop_off, start_position)

number_of_states = field.get_number_of_states()
number_of_actions = 6

q_table = np.zeros((number_of_states, number_of_actions))

epsilon = 0.1
alpha = 0.1
gamma = 0.6

for _ in range(10000):
    field = Field(size, item_start, item_drop_off, start_position)
    done = False
    
    while not done:
        state = field.get_state()
        if random.uniform(0, 1) < epsilon:
            action = random.randint(0, 5)
        else:
            action = np.argmax(q_table[state])
            
        reward, done = field.make_action(action)
        # Q[state, action] = (1 – alpha) * Q[state, action] + alpha * (reward + gamma * max(Q[new_state]) — Q[state, action])
        
        new_state = field.get_state()
        new_state_max = np.max(q_table[new_state])
        
        q_table[state, action] = (1 - alpha)*q_table[state, action] + alpha*(reward + gamma*new_state_max - q_table[state, action])

Then we can apply our model as follows.

def reinforcement_learning():
    epsilon = 0.1
    alpha = 0.1
    gamma = 0.6
    
    field = Field(size, item_start, item_drop_off, start_position)
    done = False
    steps = 0
    
    while not done:
        state = field.get_state()
        if random.uniform(0, 1) < epsilon:
            action = random.randint(0, 5)
        else:
            action = np.argmax(q_table[state])
            
        reward, done = field.make_action(action)
        # Q[state, action] = (1 – alpha) * Q[state, action] + alpha * (reward + gamma * max(Q[new_state]) — Q[state, action])
        
        new_state = field.get_state()
        new_state_max = np.max(q_table[new_state])
        
        q_table[state, action] = (1 - alpha)*q_table[state, action] + alpha*(reward + gamma*new_state_max - q_table[state, action])
        
        steps = steps + 1
    
    return steps

And evaluate it as follows.

runs_rl = [reinforcement_learning() for _ in range(100)]
print(sum(runs_rl)/len(runs_rl))

This resulted in 47.45. Again, you should get something different.

But a comparison to taking random moves (Step 4) it is a factor 3000 better.

Want more?

Want to learn more Python, then this is part of a 8 hours FREE video course with full explanations, projects on each levels, and guided solutions.

The course is structured with the following resources to improve your learning experience.

  • 17 video lessons teaching you everything you need to know to get started with Python.
  • 34 Jupyter Notebooks with lesson code and projects.
  • A FREE 70+ pages eBook with all the learnings from the lessons.

See the full FREE course page here.

If you instead want to learn more about Machine Learning. Do not worry.

Then check out my Machine Learning with Python course.

  • 15 video lessons teaching you all aspects of Machine Learning
  • 30 JuPyter Notebooks with lesson code and projects
  • 10 hours FREE video content to support your learning journey.

Go to the course page for details.

Leave a Reply