What will we learn in this article?
The Q-Learning algorithm is a nice and easy to understand algorithm used with Reinforcement Learning paradigm in Machine Learning. It can be implemented from scratch and we will do that in this article.
After you go through this article you will know what Reinforcement Learning is, the main types of algorithm used, fully understand the Q-learning algorithm and implement an awesome example from scratch in Python.
The steps towards that are.
- Learn and understand what reinforcement learning in machine learning?
- What are the main algorithm in reinforcement learning?
- Deep dive to understand the Q-learning algorithm
- Implement a task that we want the Q-learning algorithm to learn – first we let a random choices try (1540 steps on average).
- Then we implement the Q-learning algorithm from scratch and let it solve learn how to solve it (22 steps).
Step 1: What is Reinforcement Learning?
Reinforcement learning teaches the machine to think for itself based on past action rewards.

Basically, the Reinforcement Learning algorithm tries to predict actions that gives rewards and avoids punishment.
It is like training a dog. You and the dog do not talk the same language, but the dogs learns how to act based on rewards (and punishment, which I do not advise or advocate).
Hence, if a dog is rewarded for a certain action in a given situation, then next time it is exposed to a similar situation it will act the same.
Translate that to Reinforcement Learning.
- The agent is the dog that is exposed to the environment.
- Then the agent encounters a state.
- The agent performs an action to transition from that state to a new state.
- Then after the transition the agent receives a reward or penalty (punishment).
- This forms a policy to create a strategy to choose actions in a given state.
Step 2: What are the algorithm used for Reinforcement Learning?
The most common algorithm for Reinforcement Learning are.
- Q-Learning: is a model-free reinforcement learning algorithm to learn a policy telling an agent what action to take under what circumstances.
- Temporal Difference: refers to a class of model-free reinforcement learning methods which learn by bootstrapping from the current estimate of the value function.
- Deep Adversarial Network: is a technique employed in the field of machine learning which attempts to fool models through malicious input.
We will focus on the Q-learning algorithm as it is easy to understand as well as powerful.
Step 3: Understand the Q-Learning algorithm
As already noted, I just love this algorithm. It is “easy” to understand and seems very powerful.

The Q-Learning algorithm has a Q-table (a Matrix of dimension state x actions – don’t worry if you do not understand what a Matrix is, you will not need the mathematical aspects of it – it is just an indexed “container” with numbers).
- The agent (or Q-Learning algorithm) will be in a state.
- Then in each iteration the agent needs take an action.
- The agent will continuously update the reward in the Q-table.
- The learning can come from either exploiting or exploring.
This translates into the following pseudo algorithm for the Q-Learning.
The agent is in a given state and needs to choose an action.
- Initialise the Q-table to all zeros
- Iterate:
- Agent is in state state.
- With probability epsilon choose to explore, else exploit.
- If explore, then choose a random action.
- If exploit, then choose the best action based on the current Q-table.
- Update the Q-table from the new reward to the previous state.
- Q[state, action] = (1 – alpha) * Q[state, action] + alpha * (reward + gamma * max(Q[new_state]) — Q[state, action])
As you can se, we have introduced the following variables.
- epsilon: the probability to take a random action, which is done to explore new territory.
- alpha: is the learning rate that the algorithm should make in each iteration and should be in the interval from 0 to 1.
- gamma: is the discount factor used to balance the immediate and future reward. This value is usually between 0.8 and 0.99
- reward: is the feedback on the action and can be any number. Negative is penalty (or punishment) and positive is a reward.
Step 4: A task we want the Q-learning algorithm to master
We need to test and understand our the above algorithm. So far, it is quite abstract. To do that we will create a simple task to show how the Q-learning algorithm will solve it efficient by learning by rewards.
To keep it simple, we create a field of size 10×10 positions. In that field there is an item that needs to be picket up and moved to a drop-off point.
At each position there are 6 different actions that can be taken.
- Action 0: Go south if on field.
- Action 1: Go north if on field.
- Action 2: Go east if on field.
- Action 3: Go west if on field.
- Action 4: Pickup item (it can try even if it is not there)
- Action 5: Drop-off item (it can try even if it does not have it)
Based on these action we will make a reward system.
- If the agent tries to go off the field, punish with -10 in reward.
- If the agent makes a (legal) move, punish with -1 in reward, as we do not want to encourage endless walking around.
- If the agent tries to pick up item, but it is not there or it has it already, punish with 10.
- If the agent picks up the item correct place, reward with 20.
- If agent tries to drop-off item in wrong place or does not have the item, punish with 10.
- If the agent drops-off item in correct place, reward with 20.
That translates into the following code. I prefer to implement this code, as I think the standard libraries that provide similar frameworks hide some important details. As an example, and shown later, how do you map this into a state in the Q-table?
class Field:
def __init__(self, size, item_pickup, item_drop_off, start_position):
self.size_x = size
self.size_y = size
self.item_in_car = False
self.item_position = item_pickup
self.item_drop_off = item_drop_off
self.position = start_position
def move_driver(self, action):
(x, y) = self.item_position
if action == 0: # south
if y == 0:
return -10, False
else:
self.item_position = (x, y-1)
return -1, False
elif action == 1: # north
if y == self.size_y - 1:
return -10, False
else:
self.item_position = (x, y+1)
return -1, False
elif action == 2: # east
if x == self.size_x - 1:
return -10, False
else:
self.item_position = (x+1, y)
return -1, False
elif action == 3: # west
if x == 0:
return -10, False
else:
self.item_position = (x-1, y)
return -1, False
elif action == 4: # pickup
if self.item_in_car:
return -10, False
elif self.item_position != (x, y):
return -10, False
else:
self.item_in_car = True
return 20, False
elif action == 5: # drop-off
if not self.item_in_car:
return -10, False
elif self.item_drop_off != (x, y):
self.item_position = (x, y)
return -20, False
else:
return 20, True
If you let the agent just do random actions, how long will it take for it to succeed (to be done)? Let us try that out.
import random
size = 10
item_start = (0, 0)
item_drop_off = (9, 9)
start_position = (9, 0)
field = Field(size, item_start, item_drop_off, start_position)
done = False
steps = 0
while not done:
action = random.randrange(0, 6)
reward, done = field.move_driver(action)
steps += 1
print(steps)
A single run of that resulted in 2756 steps. That seems to be inefficient. I ran it 1000 times to find an average, which resulted to 1540 steps on average.
Step 5: How the Q-learning algorithm can improve that
There is a learning phase where the Q-table is updated iteratively. But before that, we need to add two helper functions to our Field.
- We need to be able to map the current it to a state to an index in the Q-table.
- Further, we need to a get the number of states needed in the Q-table, which we need to know when we initialise the Q-table.
import numpy as np
import random
class Field:
def __init__(self, size, item_pickup, item_drop_off, start_position):
self.size_x = size
self.size_y = size
self.item_in_car = False
self.item_position = item_pickup
self.item_drop_off = item_drop_off
self.position = start_position
def get_number_of_states(self):
return self.size_x*self.size_y*self.size_x*self.size_y*2
def get_state(self):
state = self.item_position[0]*(self.size_y*self.size_x*self.size_y*2)
state += self.item_position[1]*(self.size_x*self.size_y*2)
state += self.position[0] * (self.size_y * 2)
state += self.position[1] * (2)
if self.item_in_car:
state += 1
return state
def move_driver(self, action):
(x, y) = self.item_position
if action == 0: # south
if y == 0:
return -10, False
else:
self.item_position = (x, y-1)
return -1, False
elif action == 1: # north
if y == self.size_y - 1:
return -10, False
else:
self.item_position = (x, y+1)
return -1, False
elif action == 2: # east
if x == self.size_x - 1:
return -10, False
else:
self.item_position = (x+1, y)
return -1, False
elif action == 3: # west
if x == 0:
return -10, False
else:
self.item_position = (x-1, y)
return -1, False
elif action == 4: # pickup
if self.item_in_car:
return -10, False
elif self.item_position != (x, y):
return -10, False
else:
self.item_in_car = True
return 20, False
elif action == 5: # drop-off
if not self.item_in_car:
return -10, False
elif self.item_drop_off != (x, y):
self.item_position = (x, y)
return -20, False
else:
return 20, True
Then we can generate our Q-table by iterating over the task 1000 times (it is just an arbitrary number I chose). As you see, it simply just runs over the task again and again, but updates the Q-table with the “learnings” based on the reward.
states = field.get_number_of_states()
actions = 6
q_table = np.zeros((states, actions))
alpha = 0.1
gamma = 0.6
epsilon = 0.1
for i in range(1000):
field = Field(size, item_start, item_drop_off, start_position)
done = False
steps = 0
while not done:
state = field.get_state()
if random.uniform(0, 1) < epsilon:
action = random.randrange(0, 6)
else:
action = np.argmax(q_table[state])
reward, done = field.move_driver(action)
next_state = field.get_state()
old_value = q_table[state, action]
next_max = np.max(q_table[next_state])
new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
q_table[state, action] = new_value
steps += 1
After that we can use it, our Q-table is updated. To test it, we will run the same code again, just with the updated Q-table.
alpha = 0.1
gamma = 0.6
epsilon = 0.1
field = Field(size, item_start, item_drop_off, start_position)
done = False
steps = 0
while not done:
state = field.get_state()
if random.uniform(0, 1) < epsilon:
action = random.randrange(0, 6)
else:
action = np.argmax(q_table[state])
reward, done = field.move_driver(action)
next_state = field.get_state()
old_value = q_table[state, action]
next_max = np.max(q_table[next_state])
new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
q_table[state, action] = new_value
steps += 1
print(steps)
This resulted in 22 steps. That is awesome.
Python Circle
Do you know what the 5 key success factors every programmer must have?
How is it possible that some people become programmer so fast?
While others struggle for years and still fail.
Not only do they learn python 10 times faster they solve complex problems with ease.
What separates them from the rest?
I identified these 5 success factors that every programmer must have to succeed:
- Collaboration: sharing your work with others and receiving help with any questions or challenges you may have.
- Networking: the ability to connect with the right people and leverage their knowledge, experience, and resources.
- Support: receive feedback on your work and ask questions without feeling intimidated or judged.
- Accountability: stay motivated and accountable to your learning goals by surrounding yourself with others who are also committed to learning Python.
- Feedback from the instructor: receiving feedback and support from an instructor with years of experience in the field.
I know how important these success factors are for growth and progress in mastering Python.
That is why I want to make them available to anyone struggling to learn or who just wants to improve faster.
With the Python Circle community, you can take advantage of 5 key success factors every programmer must have.

Be part of something bigger and join the Python Circle community.