Simple Machine Learning Trading Bot in Python – Evaluating how it Performs

What will we cover in this tutorial

  • To create a machine learning trading bot in Python
  • How to build a simple Reinforcement Learning Trading bot.
  • The idea behind the Reinforcement Learning trading bot
  • Evaluate how the trading bot performs

Machine Learning and Trading?

First thing first. Machine Learning trading bot? Machine Learning can be used for various things in regards to trading.

Well, good to set our expectations. This tutorial is also experimental and does not claim to make a bullet-proof Machine Learning Trading bot that will make you rich. I strongly advice you not to use it for automated trading.

This tutorial is only intended to test and learn about how a Reinforcement Learning strategy can be used to build a Machine Learning Trading Bot.

Step 1: The idea behind the Reinforcement Learning strategy

I wanted to test how a Reinforcement Learning algorithm would do in the market.

First let us understand what Reinforcement Learning is. Reinforcement learning teaches the machine to think for itself based on past action rewards.

Reinforcement Learning (in Machine Learning) teaches the machine to think based on past action rewards.
Reinforcement Learning (in Machine Learning) teaches the machine to think based on past action rewards.

It is like training a dog. You and the dog do not talk the same language, but the dogs learns how to act based on rewards (and punishment, which I do not advise or advocate). 

Hence, if a dog is rewarded for a certain action in a given situation, then next time it is exposed to a similar situation it will act the same. 

Translate that to Reinforcement Learning. 

  • The agent is the dog that is exposed to the environment
  • Then the agent encounters a state
  • The agent performs an action to transition from that state to a new state
  • Then after the transition the agent receives a reward or penalty(punishment).
  • This forms a policy to create a strategy to choose actions in a given state

That turns out to fit well with trading, or potentially? That is what I want to investigate.

Step 2: The idea behind how to use Reinforcement Learning in Trading

The environment in trading could be translated to rewards and penalties (punishment). You win or loose on the stock market, right?

But we also want to simplify the environment for the bot, not to make it too complex. Hence, in this experiment, the bot is only knows 1 stock and has to decide to buy, keep or sell.

Said differently.

  • The trading bot (agent) is exposed to the stock history (environment).
  • Then the trading bot (agent) encounters the new stock price (state).
  • The trading bot (agent) then performs a choice to keep, sell or buy (action), which brings it to a new state.
  • Then the trading bot (agent) will receives a reward based on the value difference from day to day.

The reward will often first be encountered after some time, hence, the feedback from steps after should be set high. Or at least, that is my expectation.

Step 3: Understand Q-learning as the Reinforcement Learning model

The Q-learning model is easy to understand and has potential to be very powerful. Of course, it is not better than the design of it. But before we can design it, we need to understand the mechanism behind it.

Q-Learning algorithm (Reinforcement / Machine Learning) - exploit or explore - Update Q-table
Q-Learning algorithm (Reinforcement / Machine Learning) – exploit or explore – Update Q-table

The Q-Learning algorithm has a Q-table (a Matrix of dimension state x actions – don’t worry if you do not understand what a Matrix is, you will not need the mathematical aspects of it – it is just an indexed “container” with numbers).

  • The agent (or Q-Learning algorithm) will be in a state.
  • Then in each iteration the agent needs take an action.
  • The agent will continuously update the reward in the Q-table.
  • The learning can come from either exploiting or exploring.

This translates into the following pseudo algorithm for the Q-Learning. 

The agent is in a given state and needs to choose an action.

  • Initialise the Q-table to all zeros
  • Iterate:
    • Agent is in state state.
    • With probability epsilon choose to explore, else exploit.
      • If explore, then choose a random action.
      • If exploit, then choose the best action based on the current Q-table.
    • Update the Q-table from the new reward to the previous state.
      • Q[stateaction] = (1 – alpha) * Q[stateaction] + alpha * (rewardgamma * max(Q[new_state]) — Q[state, action])

As you can se, we have introduced the following variables.

  • epsilon: the probability to take a random action, which is done to explore new territory.
  • alpha: is the learning rate that the algorithm should make in each iteration and should be in the interval from 0 to 1.
  • gamma: is the discount factor used to balance the immediate and future reward. This value is usually between 0.8 and 0.99
  • reward: is the feedback on the action and can be any number. Negative is penalty (or punishment) and positive is a reward.

Step 4: The choices we need to take

Based on that, we need to see how the algorithm should map the stock information to a state. We want the model to be fairly simple and not have too many states, as it will take long time to populate it with data.

There are many parameters to choose from here. As we do not want to tell the algorithm what to do, we still need to feed it what what we find as relevant data.

In this case it was the following.

  • Volatility of the share.
  • The percentage change of the daily short mean (average over last 20 days).
  • Then the percentage of the daily long mean (average over the last 100 days).
  • The daily long mean, which is the average over the last 100 days.
  • The volume of the sales that day.

These values need to be calculated for the share we use. That can be done by the following code.

import pandas_datareader as pdr
import numpy as np


VALUE = 'Adj Close'
ID = 'id'
NAME = 'name'
DATA = 'data'


def get_data(name, years_ago):
    start = dt.datetime.now() - relativedelta(years=years_ago)
    end = dt.datetime.now()
    df = pdr.get_data_yahoo(name, start, end)
    return df


def process():
    stock = {ID: stock, NAME: 'AAPL'}

    stock[DATA] = get_data(stock[ID], 20)

# Updatea it will all values
    stock[DATA]['Short Mean'] = stock[DATA][VALUE].rolling(window=short_window).mean()
    stock[DATA]['Long Mean'] = stock[DATA][VALUE].rolling(window=long_window).mean()

    stock[DATA]['Daily Change'] = stock[DATA][VALUE].pct_change()
    stock[DATA]['Daily Short Change'] = stock[DATA]['Short Mean'].pct_change()
    stock[DATA]['Daily Long Change'] = stock[DATA]['Long Mean'].pct_change()
    stock[DATA]['Volatility'] = stock[DATA]['Daily Change'].rolling(75).std()*np.sqrt(75)

As you probably notice, this will create a challenge. You need to put them into bins, that is a fixed number of “boxes” to fit in.

def process():
    #...
    # Let's put data in bins
    stock[DATA]['Vla bin'] = pd.cut(stock[DATA]['Volatility'], bins=STATES_DIM, labels=False)
    stock[DATA]['Srt ch bin'] = pd.cut(stock[DATA]['Daily Short Change'], bins=STATES_DIM, labels=False)
    stock[DATA]['Lng ch bin'] = pd.cut(stock[DATA]['Daily Long Change'], bins=STATES_DIM, labels=False)
    # stock[DATA]['Srt mn bin'] = pd.cut(stock[DATA]['Short Mean'], bins=DIM, labels=False)
    stock[DATA]['Lng mn bin'] = pd.cut(stock[DATA]['Long Mean'], bins=STATES_DIM, labels=False)
    stock[DATA]['Vol bin'] = pd.cut(stock[DATA]['Volume'], bins=STATES_DIM, labels=False)

This will quantify the 5 dimensions into STATES_DIM, which you can define to what you think is appropriate.

Step 5: How to model it

This can be done by creating an environment, that will play the role as your trading account.

class Account:
    def __init__(self, cash=1000000, brokerage=0.001):
        self.cash = cash
        self.brokerage = brokerage
        self.stocks = 0
        self.stock_id = None
        self.has_stocks = False

    def get_value(self, row):
        if self.has_stocks:
            return self.cash + row[VALUE] * self.stocks
        else:
            return self.cash

    def buy_stock(self, stock_id, row):
        if self.has_stocks:
            return
        self.stock_id = stock_id
        self.stocks = int(self.cash // (row[VALUE]*(1.0 + self.brokerage)))
        self.cash -= self.stocks*row[VALUE]*1.001
        self.has_stocks = True
        self.print_status(row, "Buy")

    def sell_stock(self, row):
        if not self.has_stocks:
            return
        self.print_status(row, "Sell")
        self.cash += self.stocks * (row[VALUE]*(1.0 - self.brokerage))
        self.stock_id = None
        self.stocks = 0
        self.has_stocks = False

    def print_status(self, row, title="Status"):
        if self.has_stocks:
            print(title, self.stock_id, "TOTAL:", self.cash + self.stocks*float(row[VALUE]))
            print(" - ", row.name, "price", row[VALUE])
            print(" - ", "Short", row['Daily Short Change'])
            print(" - ", "Long", row['Daily Long Change'])
        else:
            print(title, "TOTAL", self.cash)

Then it should be iterated over a time where the trading bot can decide what to do.

def process():
    # Now let's prepare our model
    q_learning = QModel()
    account = Account()

    state = None
    reward = 0.0
    action = 0
    last_value = 0.0
    for index, row in stock[DATA].iterrows():
        if state is not None:
            # The reward is the immediate return
            reward = account.get_value(row) - last_value
            # You update the day after the action, when you know the results of your actions
            q_learning.update_reward(row, account.has_stocks, action, state, reward)
        action, state = q_learning.get_action(row, account.has_stocks)

        if action == 0:
            pass
        elif action == 1:
            if account.has_stocks:
                account.sell_stock(row)
            else:
                account.buy_stock(stock[ID], row)
        last_value = account.get_value(row)
    account.print_status(row)
    q_learning.save_pickle()
    return last_value

This code will do what ever the trading bot tells you to do.

Step 6: The Q-learning model

Now to the core of the thing. The actual trading bot, that knows nothing about trading. But can we train it to earn money on trading and how much? We will see that later.

class QModel:
    def __init__(self, alpha=0.5, gamma=0.7, epsilon=0.1):
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon

        self.states_per_dim = STATES_DIM
        self.dim = 5
        self.states = (self.states_per_dim ** self.dim) * 2
        self.actions = 2
        self.pickle = "q_model7.pickle"
        self.q_table = np.zeros((self.states, self.actions))
        if os.path.isfile(self.pickle):
            print("Loading pickle")
            with open(self.pickle, "rb") as f:
                self.q_table = pickle.load(f)

    def save_pickle(self):
        with open(self.pickle, "wb") as f:
            pickle.dump(self.q_table, f)

    def get_state(self, row, has_stock):
        dim = []
        dim.append(int(row['Vla bin']))
        dim.append(int(row['Srt ch bin']))
        dim.append(int(row['Lng ch bin']))
        dim.append(int(row['Lng mn bin']))
        dim.append(int(row['Vol bin']))
        for i in range(len(dim)):
            if dim[i] is None:
                dim[i] = 0
        dimension = 0
        if has_stock:
            dimension = 1 * (self.states_per_dim ** self.dim)
        dimension += dim[4] * (self.states_per_dim ** 4)
        dimension += dim[3] * (self.states_per_dim ** 3)
        dimension += dim[2] * (self.states_per_dim ** 2)
        dimension += dim[1] * (self.states_per_dim ** 1)
        dimension += dim[0]
        return dimension

    def get_action(self, row, has_stock):
        state = self.get_state(row, has_stock)

        if random.uniform(0, 1) < self.epsilon:
            action = random.randrange(0, self.actions)
        else:
            action = np.argmax(self.q_table[state])
        return action, state

    def update_reward(self, row, has_stock, last_action, last_state, reward):
        next_state = self.get_state(row, has_stock)

        old_value = self.q_table[last_state, last_action]
        next_max = np.max(self.q_table[next_state])

        new_value = (1 - self.alpha) * old_value + self.alpha * (reward + self.gamma * next_max)
        self.q_table[last_state, last_action] = new_value

Now we have the full code to try it out (the full code is at the end of the tutorial).

Step 7: Training the model

Now we need to train the model.

For that purpose, I have made a list of 134 stocks that I used and placed them in a CSV file.

Then the training is simply to read 1 of the 134 stocks in with 10 years of historical data. Find an 1 year window and run the algorithm on it.

The repeat.

f __name__ == "__main__":
    # source: http://www.nasdaqomxnordic.com/shares/listed-companies/copenhagen
    csv_stock_file = 'DK-Stocks.csv'

    while True:
        iterations = 1000
        for i in range(iterations):
            # Go at most 9 years back, as we only have 10 years available and need 1 year of data
            days_back = random.randrange(0, 9*365)
            process(csv_stock_file)

Then let it run and run and run and run again.

Step 8: Testing the algorithm

Of course, the testing should be done on unknown data. That is a stock it does not know. But you cannot also re-run on the same stock, as it will learn from it (unless you do not save the state from it).

Hence, I chose a good performing stock to see how it would do, to see if it could beat the buy-first-day-and-sell-last-day strategy.

The results of the trading bot on Apple stocks.

The return of 1,000,000$ investment with the Trading Bot was approximately 1,344,500$. This is a return on 34% for one year. Comparing that with the stock price itself.

Stock price was 201.55$ on July 1st 2019 and 362.09$ on June 30th, 2020. This would give the following return (0,10% in brokerage should be included in calculations as the Trading bot pays that on each sell and buy).

  • 1,792,847$

That does not look that good. That means that a simple strategy to buy on day one and sell on the last day would return more than the bot.

Of course, you can’t conclude it is not possible to do better on other stocks, but for this case it was not impressive.

Variations and next step

There are many variable to adjust, I especially think I set the gamma too low. There are other parameters to use to make the state. Can remove some, that might be making noice, and add ones that are more relevant. Also, the number of bins can be adjusted. That the bins are made independent of each other, might also be a problem.

Also read the tutorial on reinforcement learning.

5 Steps to Master the Reinforcement Learning with a Q-Learning Python Example

What will we learn in this article?

The Q-Learning algorithm is a nice and easy to understand algorithm used with Reinforcement Learning paradigm in Machine Learning. It can be implemented from scratch and we will do that in this article.

After you go through this article you will know what Reinforcement Learning is, the main types of algorithm used, fully understand the Q-learning algorithm and implement an awesome example from scratch in Python.

The steps towards that are.

  • Learn and understand what reinforcement learning in machine learning?
  • What are the main algorithm in reinforcement learning?
  • Deep dive to understand the Q-learning algorithm
  • Implement a task that we want the Q-learning algorithm to learn – first we let a random choices try (1540 steps on average).
  • Then we implement the Q-learning algorithm from scratch and let it solve learn how to solve it (22 steps).

Step 1: What is Reinforcement Learning?

Reinforcement learning teaches the machine to think for itself based on past action rewards.

Reinforcement Learning (in Machine Learning) teaches the machine to think based on past action rewards.
Reinforcement Learning (in Machine Learning) teaches the machine to think based on past action rewards.

Basically, the Reinforcement Learning algorithm tries to predict actions that gives rewards and avoids punishment.

It is like training a dog. You and the dog do not talk the same language, but the dogs learns how to act based on rewards (and punishment, which I do not advise or advocate).

Hence, if a dog is rewarded for a certain action in a given situation, then next time it is exposed to a similar situation it will act the same.

Translate that to Reinforcement Learning.

  • The agent is the dog that is exposed to the environment.
  • Then the agent encounters a state.
  • The agent performs an action to transition from that state to a new state.
  • Then after the transition the agent receives a reward or penalty (punishment).
  • This forms a policy to create a strategy to choose actions in a given state.

Step 2: What are the algorithm used for Reinforcement Learning?

The most common algorithm for Reinforcement Learning are.

We will focus on the Q-learning algorithm as it is easy to understand as well as powerful.

Step 3: Understand the Q-Learning algorithm

As already noted, I just love this algorithm. It is “easy” to understand and seems very powerful.

Q-Learning algorithm (Reinforcement / Machine Learning) - exploit or explore - Update Q-table
Q-Learning algorithm (Reinforcement / Machine Learning) – exploit or explore – Update Q-table

The Q-Learning algorithm has a Q-table (a Matrix of dimension state x actions – don’t worry if you do not understand what a Matrix is, you will not need the mathematical aspects of it – it is just an indexed “container” with numbers).

  • The agent (or Q-Learning algorithm) will be in a state.
  • Then in each iteration the agent needs take an action.
  • The agent will continuously update the reward in the Q-table.
  • The learning can come from either exploiting or exploring.

This translates into the following pseudo algorithm for the Q-Learning.

The agent is in a given state and needs to choose an action.

  • Initialise the Q-table to all zeros
  • Iterate:
    • Agent is in state state.
    • With probability epsilon choose to explore, else exploit.
      • If explore, then choose a random action.
      • If exploit, then choose the best action based on the current Q-table.
    • Update the Q-table from the new reward to the previous state.
      • Q[state, action] = (1 – alpha) * Q[state, action] + alpha * (reward + gamma * max(Q[new_state]) — Q[state, action])

As you can se, we have introduced the following variables.

  • epsilon: the probability to take a random action, which is done to explore new territory.
  • alpha: is the learning rate that the algorithm should make in each iteration and should be in the interval from 0 to 1.
  • gamma: is the discount factor used to balance the immediate and future reward. This value is usually between 0.8 and 0.99
  • reward: is the feedback on the action and can be any number. Negative is penalty (or punishment) and positive is a reward.

Step 4: A task we want the Q-learning algorithm to master

We need to test and understand our the above algorithm. So far, it is quite abstract. To do that we will create a simple task to show how the Q-learning algorithm will solve it efficient by learning by rewards.

To keep it simple, we create a field of size 10×10 positions. In that field there is an item that needs to be picket up and moved to a drop-off point.

At each position there are 6 different actions that can be taken.

  • Action 0: Go south if on field.
  • Action 1: Go north if on field.
  • Action 2: Go east if on field.
  • Action 3: Go west if on field.
  • Action 4: Pickup item (it can try even if it is not there)
  • Action 5: Drop-off item (it can try even if it does not have it)

Based on these action we will make a reward system.

  • If the agent tries to go off the field, punish with -10 in reward.
  • If the agent makes a (legal) move, punish with -1 in reward, as we do not want to encourage endless walking around.
  • If the agent tries to pick up item, but it is not there or it has it already, punish with 10.
  • If the agent picks up the item correct place, reward with 20.
  • If agent tries to drop-off item in wrong place or does not have the item, punish with 10.
  • If the agent drops-off item in correct place, reward with 20.

That translates into the following code. I prefer to implement this code, as I think the standard libraries that provide similar frameworks hide some important details. As an example, and shown later, how do you map this into a state in the Q-table?

class Field:
    def __init__(self, size, item_pickup, item_drop_off, start_position):
        self.size_x = size
        self.size_y = size
        self.item_in_car = False
        self.item_position = item_pickup
        self.item_drop_off = item_drop_off
        self.position = start_position

    def move_driver(self, action):
        (x, y) = self.item_position
        if action == 0: # south
            if y == 0:
                return -10, False
            else:
                self.item_position = (x, y-1)
                return -1, False
        elif action == 1: # north
            if y == self.size_y - 1:
                return -10, False
            else:
                self.item_position = (x, y+1)
                return -1, False
        elif action == 2: # east
            if x == self.size_x - 1:
                return -10, False
            else:
                self.item_position = (x+1, y)
                return -1, False
        elif action == 3: # west
            if x == 0:
                return -10, False
            else:
                self.item_position = (x-1, y)
                return -1, False
        elif action == 4: # pickup
            if self.item_in_car:
                return -10, False
            elif self.item_position != (x, y):
                return -10, False
            else:
                self.item_in_car = True
                return 20, False
        elif action == 5: # drop-off
            if not self.item_in_car:
                return -10, False
            elif self.item_drop_off != (x, y):
                self.item_position = (x, y)
                return -20, False
            else:
                return 20, True

If you let the agent just do random actions, how long will it take for it to succeed (to be done)? Let us try that out.

import random


size = 10
item_start = (0, 0)
item_drop_off = (9, 9)
start_position = (9, 0)

field = Field(size, item_start, item_drop_off, start_position)
done = False
steps = 0
while not done:
    action = random.randrange(0, 6)
    reward, done = field.move_driver(action)
    steps += 1
print(steps)

A single run of that resulted in 2756 steps. That seems to be inefficient. I ran it 1000 times to find an average, which resulted to 1540 steps on average.

Step 5: How the Q-learning algorithm can improve that

There is a learning phase where the Q-table is updated iteratively. But before that, we need to add two helper functions to our Field.

  • We need to be able to map the current it to a state to an index in the Q-table.
  • Further, we need to a get the number of states needed in the Q-table, which we need to know when we initialise the Q-table.
import numpy as np
import random


class Field:
    def __init__(self, size, item_pickup, item_drop_off, start_position):
        self.size_x = size
        self.size_y = size
        self.item_in_car = False
        self.item_position = item_pickup
        self.item_drop_off = item_drop_off
        self.position = start_position

    def get_number_of_states(self):
        return self.size_x*self.size_y*self.size_x*self.size_y*2

    def get_state(self):
        state = self.item_position[0]*(self.size_y*self.size_x*self.size_y*2)
        state += self.item_position[1]*(self.size_x*self.size_y*2)
        state += self.position[0] * (self.size_y * 2)
        state += self.position[1] * (2)
        if self.item_in_car:
            state += 1
        return state

    def move_driver(self, action):
        (x, y) = self.item_position
        if action == 0: # south
            if y == 0:
                return -10, False
            else:
                self.item_position = (x, y-1)
                return -1, False
        elif action == 1: # north
            if y == self.size_y - 1:
                return -10, False
            else:
                self.item_position = (x, y+1)
                return -1, False
        elif action == 2: # east
            if x == self.size_x - 1:
                return -10, False
            else:
                self.item_position = (x+1, y)
                return -1, False
        elif action == 3: # west
            if x == 0:
                return -10, False
            else:
                self.item_position = (x-1, y)
                return -1, False
        elif action == 4: # pickup
            if self.item_in_car:
                return -10, False
            elif self.item_position != (x, y):
                return -10, False
            else:
                self.item_in_car = True
                return 20, False
        elif action == 5: # drop-off
            if not self.item_in_car:
                return -10, False
            elif self.item_drop_off != (x, y):
                self.item_position = (x, y)
                return -20, False
            else:
                return 20, True

Then we can generate our Q-table by iterating over the task 1000 times (it is just an arbitrary number I chose). As you see, it simply just runs over the task again and again, but updates the Q-table with the “learnings” based on the reward.

states = field.get_number_of_states()
actions = 6

q_table = np.zeros((states, actions))

alpha = 0.1
gamma = 0.6
epsilon = 0.1

for i in range(1000):
    field = Field(size, item_start, item_drop_off, start_position)
    done = False
    steps = 0
    while not done:
        state = field.get_state()
        if random.uniform(0, 1) < epsilon:
            action = random.randrange(0, 6)
        else:
            action = np.argmax(q_table[state])

        reward, done = field.move_driver(action)
        next_state = field.get_state()

        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])

        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
        q_table[state, action] = new_value

        steps += 1

After that we can use it, our Q-table is updated. To test it, we will run the same code again, just with the updated Q-table.

alpha = 0.1
gamma = 0.6
epsilon = 0.1

field = Field(size, item_start, item_drop_off, start_position)
done = False
steps = 0
while not done:
    state = field.get_state()
    if random.uniform(0, 1) < epsilon:
        action = random.randrange(0, 6)
    else:
        action = np.argmax(q_table[state])

    reward, done = field.move_driver(action)
    next_state = field.get_state()

    old_value = q_table[state, action]
    next_max = np.max(q_table[next_state])

    new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
    q_table[state, action] = new_value

    steps += 1

print(steps)

This resulted in 22 steps. That is awesome.