Reinforcement Learning Explained with Real Problem and Code from Scratch

What will we cover?

  • Understand how Reinforcement Learning works
  • Learn about Agent and Environment
  • How it iterates and gets rewards based on action
  • How to continuously learn new things
  • Create own Reinforcement Learning from scratch

Step 1: Reinforcement Learning simply explained

Reinforcement Learning

Reinforcement Learning is like training a dog. You and the dog talk different languages. This makes it difficult to explain the dog what you want.

A common way to train a dog is like Reinforcement Learning. When the dog does something good, it get’s a reward. This teaches the dog that you want it to do it.

Said differently, if we relate it to the illustration above. The Agent is the dog. The dog is exposed to an Environment called a state. Based on this Agent (the dog) takes an Action. Based on whether you (the owner) likes the Action, you Reward the Agent.

The goal of the Agent is to get the most Reward. This way it makes it possible for you the owner to get the desired behaviour with adjusting the Reward according to the Actions.

Step 2: Markov Decision Process

The model for decision-making represents States (from the Environment), Actions (from the Agent), and the Rewards.

Written a bit mathematical.

  • S is the set of States
  • Actions(s) is the set of Actions when in state s
  • The transition model is P(sยด, s, a)
  • The Reward function R(s, a, s’)

Step 3: Q-Learning

Q-learning is a model-free reinforcement learning algorithm to learn the value of an action in a particular state. It does not require a model of the environment (hence “model-free”), and it can handle problems with stochastic transitions and rewards without requiring adaptations. (wiki)

This can be modeled by a learning function Q(s, a), which estimates the value of performing action a when in state s.

It works as follows

  • Start with Q(s, a) = 0 for all s, a
  • Update Q when we take an action

๐‘„(๐‘ ,๐‘Ž)=๐‘„(๐‘ ,๐‘Ž)+๐›ผ(Q(s,a)=Q(s,a)+ฮฑ(reward+๐›พmax(๐‘ โ€ฒ,๐‘Žโ€ฒ)โˆ’๐‘„(๐‘ ,๐‘Ž))=(1โˆ’๐›ผ)๐‘„(๐‘ ,๐‘Ž)+๐›ผ(+ฮณmax(sโ€ฒ,aโ€ฒ)โˆ’Q(s,a))=(1โˆ’ฮฑ)Q(s,a)+ฮฑ(reward+๐›พmax(๐‘ โ€ฒ,๐‘Žโ€ฒ))+ฮณmax(sโ€ฒ,aโ€ฒ))

The ฯต-Greedy Decision Making

The idea behind it is to either explore or exploit

  • With probability ฯต take a random move
  • Otherwise, take action ๐‘Ža with maximum ๐‘„(๐‘ ,๐‘Ž)

Let’s demonstrate it with code.

Step 3: Code Example

Assume we have the following Environment

Environment
  • You start at a random point.
  • You can either move left or right.
  • You loose if you hit a red box
  • You win if you hit the green box

Quite simple, but how can you program an Agent using Reinforcement Learning? And how can you do it from scratch.

The great way is to use an object representing the field (environment).

Field representing the Environment

To implement it all there are some background resources if needed.

Programming Notes:

What if there are more states?

import numpy as np
import random
class Field:
    def __init__(self):
        self.states = [-1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
        self.state = random.randrange(0, len(self.states))
        
    def done(self):
        if self.states[self.state] != 0:
            return True
        else:
            return False
        
    # action: 0 => left
    # action: 1 => right
    def get_possible_actions(self):
        actions = [0, 1]
        if self.state == 0:
            actions.remove(0)
        if self.state == len(self.states) - 1:
            actions.remove(1)
        return actions
    def update_next_state(self, action):
        if action == 0:
            if self.state == 0:
                return self.state, -10
            self.state -= 1
        if action == 1:
            if self.state == len(self.states) - 1:
                return self.state, -10
            self.state += 1
        
        reward = self.states[self.state]
        return self.state, reward
field = Field()
q_table = np.zeros((len(field.states), 2))
alpha = .5
epsilon = .5
gamma = .5
for _ in range(10000):
    field = Field()
    while not field.done():
        actions = field.get_possible_actions()
        if random.uniform(0, 1) < epsilon:
            action = random.choice(actions)
        else:
            action = np.argmax(q_table[field.state])
            
        cur_state = field.state
        next_state, reward = field.update_next_state(action)
        
        q_table[cur_state, action] = (1 - alpha)*q_table[cur_state, action] + alpha*(reward + gamma*np.max(q_table[next_state]))

Step 4: A more complex Example

Check out the video to see a More complex example.

Want to learn more?

This is part of a FREE 10h Machine Learning course with Python.

  • 15 video lessons โ€“ which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
  • 30 JuPyter Notebooks โ€“ with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects โ€“ with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).

Leave a Reply

%d bloggers like this: