# Reinforcement Learning Explained with Real Problem and Code from Scratch

## What will we cover?

• Understand how Reinforcement Learning works
• Learn about Agent and Environment
• How it iterates and gets rewards based on action
• How to continuously learn new things
• Create own Reinforcement Learning from scratch

## Step 1: Reinforcement Learning simply explained

Reinforcement Learning is like training a dog. You and the dog talk different languages. This makes it difficult to explain the dog what you want.

A common way to train a dog is like Reinforcement Learning. When the dog does something good, it get’s a reward. This teaches the dog that you want it to do it.

Said differently, if we relate it to the illustration above. The Agent is the dog. The dog is exposed to an Environment called a state. Based on this Agent (the dog) takes an Action. Based on whether you (the owner) likes the Action, you Reward the Agent.

The goal of the Agent is to get the most Reward. This way it makes it possible for you the owner to get the desired behaviour with adjusting the Reward according to the Actions.

## Step 2: Markov Decision Process

The model for decision-making represents States (from the Environment), Actions (from the Agent), and the Rewards.

Written a bit mathematical.

• S is the set of States
• Actions(s) is the set of Actions when in state s
• The transition model is P(s´, s, a)
• The Reward function R(s, a, s’)

## Step 3: Q-Learning

Q-learning is a model-free reinforcement learning algorithm to learn the value of an action in a particular state. It does not require a model of the environment (hence “model-free”), and it can handle problems with stochastic transitions and rewards without requiring adaptations. (wiki)

This can be modeled by a learning function Q(s, a), which estimates the value of performing action a when in state s.

It works as follows

• Start with Q(s, a) = 0 for all s, a
• Update Q when we take an action

𝑄(𝑠,𝑎)=𝑄(𝑠,𝑎)+𝛼(Q(s,a)=Q(s,a)+α(reward+𝛾max(𝑠′,𝑎′)−𝑄(𝑠,𝑎))=(1−𝛼)𝑄(𝑠,𝑎)+𝛼(+γmax(s′,a′)−Q(s,a))=(1−α)Q(s,a)+α(reward+𝛾max(𝑠′,𝑎′))+γmax(s′,a′))

### The ϵ-Greedy Decision Making

The idea behind it is to either explore or exploit

• With probability ϵ take a random move
• Otherwise, take action 𝑎a with maximum 𝑄(𝑠,𝑎)

Let’s demonstrate it with code.

## Step 3: Code Example

Assume we have the following Environment

• You start at a random point.
• You can either move left or right.
• You loose if you hit a red box
• You win if you hit the green box

Quite simple, but how can you program an Agent using Reinforcement Learning? And how can you do it from scratch.

The great way is to use an object representing the field (environment).

To implement it all there are some background resources if needed.

#### What if there are more states?

import numpy as np
import random
class Field:
def __init__(self):
self.states = [-1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
self.state = random.randrange(0, len(self.states))

def done(self):
if self.states[self.state] != 0:
return True
else:
return False

# action: 0 => left
# action: 1 => right
def get_possible_actions(self):
actions = [0, 1]
if self.state == 0:
actions.remove(0)
if self.state == len(self.states) - 1:
actions.remove(1)
return actions
def update_next_state(self, action):
if action == 0:
if self.state == 0:
return self.state, -10
self.state -= 1
if action == 1:
if self.state == len(self.states) - 1:
return self.state, -10
self.state += 1

reward = self.states[self.state]
return self.state, reward
field = Field()
q_table = np.zeros((len(field.states), 2))
alpha = .5
epsilon = .5
gamma = .5
for _ in range(10000):
field = Field()
while not field.done():
actions = field.get_possible_actions()
if random.uniform(0, 1) < epsilon:
action = random.choice(actions)
else:
action = np.argmax(q_table[field.state])

cur_state = field.state
next_state, reward = field.update_next_state(action)

q_table[cur_state, action] = (1 - alpha)*q_table[cur_state, action] + alpha*(reward + gamma*np.max(q_table[next_state]))


## Step 4: A more complex Example

Check out the video to see a More complex example.