Simple Machine Learning Trading Bot in Python – Evaluating how it Performs

What will we cover in this tutorial

  • To create a machine learning trading bot in Python
  • How to build a simple Reinforcement Learning Trading bot.
  • The idea behind the Reinforcement Learning trading bot
  • Evaluate how the trading bot performs

Machine Learning and Trading?

First thing first. Machine Learning trading bot? Machine Learning can be used for various things in regards to trading.

Well, good to set our expectations. This tutorial is also experimental and does not claim to make a bullet-proof Machine Learning Trading bot that will make you rich. I strongly advice you not to use it for automated trading.

This tutorial is only intended to test and learn about how a Reinforcement Learning strategy can be used to build a Machine Learning Trading Bot.

Step 1: The idea behind the Reinforcement Learning strategy

I wanted to test how a Reinforcement Learning algorithm would do in the market.

First let us understand what Reinforcement Learning is. Reinforcement learning teaches the machine to think for itself based on past action rewards.

Reinforcement Learning (in Machine Learning) teaches the machine to think based on past action rewards.
Reinforcement Learning (in Machine Learning) teaches the machine to think based on past action rewards.

It is like training a dog. You and the dog do not talk the same language, but the dogs learns how to act based on rewards (and punishment, which I do not advise or advocate). 

Hence, if a dog is rewarded for a certain action in a given situation, then next time it is exposed to a similar situation it will act the same. 

Translate that to Reinforcement Learning. 

  • The agent is the dog that is exposed to the environment
  • Then the agent encounters a state
  • The agent performs an action to transition from that state to a new state
  • Then after the transition the agent receives a reward or penalty(punishment).
  • This forms a policy to create a strategy to choose actions in a given state

That turns out to fit well with trading, or potentially? That is what I want to investigate.

Step 2: The idea behind how to use Reinforcement Learning in Trading

The environment in trading could be translated to rewards and penalties (punishment). You win or loose on the stock market, right?

But we also want to simplify the environment for the bot, not to make it too complex. Hence, in this experiment, the bot is only knows 1 stock and has to decide to buy, keep or sell.

Said differently.

  • The trading bot (agent) is exposed to the stock history (environment).
  • Then the trading bot (agent) encounters the new stock price (state).
  • The trading bot (agent) then performs a choice to keep, sell or buy (action), which brings it to a new state.
  • Then the trading bot (agent) will receives a reward based on the value difference from day to day.

The reward will often first be encountered after some time, hence, the feedback from steps after should be set high. Or at least, that is my expectation.

Step 3: Understand Q-learning as the Reinforcement Learning model

The Q-learning model is easy to understand and has potential to be very powerful. Of course, it is not better than the design of it. But before we can design it, we need to understand the mechanism behind it.

Q-Learning algorithm (Reinforcement / Machine Learning) - exploit or explore - Update Q-table
Q-Learning algorithm (Reinforcement / Machine Learning) – exploit or explore – Update Q-table

The Q-Learning algorithm has a Q-table (a Matrix of dimension state x actions – don’t worry if you do not understand what a Matrix is, you will not need the mathematical aspects of it – it is just an indexed “container” with numbers).

  • The agent (or Q-Learning algorithm) will be in a state.
  • Then in each iteration the agent needs take an action.
  • The agent will continuously update the reward in the Q-table.
  • The learning can come from either exploiting or exploring.

This translates into the following pseudo algorithm for the Q-Learning. 

The agent is in a given state and needs to choose an action.

  • Initialise the Q-table to all zeros
  • Iterate:
    • Agent is in state state.
    • With probability epsilon choose to explore, else exploit.
      • If explore, then choose a random action.
      • If exploit, then choose the best action based on the current Q-table.
    • Update the Q-table from the new reward to the previous state.
      • Q[stateaction] = (1 – alpha) * Q[stateaction] + alpha * (rewardgamma * max(Q[new_state]) — Q[state, action])

As you can se, we have introduced the following variables.

  • epsilon: the probability to take a random action, which is done to explore new territory.
  • alpha: is the learning rate that the algorithm should make in each iteration and should be in the interval from 0 to 1.
  • gamma: is the discount factor used to balance the immediate and future reward. This value is usually between 0.8 and 0.99
  • reward: is the feedback on the action and can be any number. Negative is penalty (or punishment) and positive is a reward.

Step 4: The choices we need to take

Based on that, we need to see how the algorithm should map the stock information to a state. We want the model to be fairly simple and not have too many states, as it will take long time to populate it with data.

There are many parameters to choose from here. As we do not want to tell the algorithm what to do, we still need to feed it what what we find as relevant data.

In this case it was the following.

  • Volatility of the share.
  • The percentage change of the daily short mean (average over last 20 days).
  • Then the percentage of the daily long mean (average over the last 100 days).
  • The daily long mean, which is the average over the last 100 days.
  • The volume of the sales that day.

These values need to be calculated for the share we use. That can be done by the following code.

import pandas_datareader as pdr
import numpy as np

VALUE = 'Adj Close'
ID = 'id'
NAME = 'name'
DATA = 'data'

def get_data(name, years_ago):
    start = - relativedelta(years=years_ago)
    end =
    df = pdr.get_data_yahoo(name, start, end)
    return df

def process():
    stock = {ID: stock, NAME: 'AAPL'}

    stock[DATA] = get_data(stock[ID], 20)

# Updatea it will all values
    stock[DATA]['Short Mean'] = stock[DATA][VALUE].rolling(window=short_window).mean()
    stock[DATA]['Long Mean'] = stock[DATA][VALUE].rolling(window=long_window).mean()

    stock[DATA]['Daily Change'] = stock[DATA][VALUE].pct_change()
    stock[DATA]['Daily Short Change'] = stock[DATA]['Short Mean'].pct_change()
    stock[DATA]['Daily Long Change'] = stock[DATA]['Long Mean'].pct_change()
    stock[DATA]['Volatility'] = stock[DATA]['Daily Change'].rolling(75).std()*np.sqrt(75)

As you probably notice, this will create a challenge. You need to put them into bins, that is a fixed number of “boxes” to fit in.

def process():
    # Let's put data in bins
    stock[DATA]['Vla bin'] = pd.cut(stock[DATA]['Volatility'], bins=STATES_DIM, labels=False)
    stock[DATA]['Srt ch bin'] = pd.cut(stock[DATA]['Daily Short Change'], bins=STATES_DIM, labels=False)
    stock[DATA]['Lng ch bin'] = pd.cut(stock[DATA]['Daily Long Change'], bins=STATES_DIM, labels=False)
    # stock[DATA]['Srt mn bin'] = pd.cut(stock[DATA]['Short Mean'], bins=DIM, labels=False)
    stock[DATA]['Lng mn bin'] = pd.cut(stock[DATA]['Long Mean'], bins=STATES_DIM, labels=False)
    stock[DATA]['Vol bin'] = pd.cut(stock[DATA]['Volume'], bins=STATES_DIM, labels=False)

This will quantify the 5 dimensions into STATES_DIM, which you can define to what you think is appropriate.

Step 5: How to model it

This can be done by creating an environment, that will play the role as your trading account.

class Account:
    def __init__(self, cash=1000000, brokerage=0.001): = cash
        self.brokerage = brokerage
        self.stocks = 0
        self.stock_id = None
        self.has_stocks = False

    def get_value(self, row):
        if self.has_stocks:
            return + row[VALUE] * self.stocks

    def buy_stock(self, stock_id, row):
        if self.has_stocks:
        self.stock_id = stock_id
        self.stocks = int( // (row[VALUE]*(1.0 + self.brokerage))) -= self.stocks*row[VALUE]*1.001
        self.has_stocks = True
        self.print_status(row, "Buy")

    def sell_stock(self, row):
        if not self.has_stocks:
        self.print_status(row, "Sell") += self.stocks * (row[VALUE]*(1.0 - self.brokerage))
        self.stock_id = None
        self.stocks = 0
        self.has_stocks = False

    def print_status(self, row, title="Status"):
        if self.has_stocks:
            print(title, self.stock_id, "TOTAL:", + self.stocks*float(row[VALUE]))
            print(" - ",, "price", row[VALUE])
            print(" - ", "Short", row['Daily Short Change'])
            print(" - ", "Long", row['Daily Long Change'])
            print(title, "TOTAL",

Then it should be iterated over a time where the trading bot can decide what to do.

def process():
    # Now let's prepare our model
    q_learning = QModel()
    account = Account()

    state = None
    reward = 0.0
    action = 0
    last_value = 0.0
    for index, row in stock[DATA].iterrows():
        if state is not None:
            # The reward is the immediate return
            reward = account.get_value(row) - last_value
            # You update the day after the action, when you know the results of your actions
            q_learning.update_reward(row, account.has_stocks, action, state, reward)
        action, state = q_learning.get_action(row, account.has_stocks)

        if action == 0:
        elif action == 1:
            if account.has_stocks:
                account.buy_stock(stock[ID], row)
        last_value = account.get_value(row)
    return last_value

This code will do what ever the trading bot tells you to do.

Step 6: The Q-learning model

Now to the core of the thing. The actual trading bot, that knows nothing about trading. But can we train it to earn money on trading and how much? We will see that later.

class QModel:
    def __init__(self, alpha=0.5, gamma=0.7, epsilon=0.1):
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon

        self.states_per_dim = STATES_DIM
        self.dim = 5
        self.states = (self.states_per_dim ** self.dim) * 2
        self.actions = 2
        self.pickle = "q_model7.pickle"
        self.q_table = np.zeros((self.states, self.actions))
        if os.path.isfile(self.pickle):
            print("Loading pickle")
            with open(self.pickle, "rb") as f:
                self.q_table = pickle.load(f)

    def save_pickle(self):
        with open(self.pickle, "wb") as f:
            pickle.dump(self.q_table, f)

    def get_state(self, row, has_stock):
        dim = []
        dim.append(int(row['Vla bin']))
        dim.append(int(row['Srt ch bin']))
        dim.append(int(row['Lng ch bin']))
        dim.append(int(row['Lng mn bin']))
        dim.append(int(row['Vol bin']))
        for i in range(len(dim)):
            if dim[i] is None:
                dim[i] = 0
        dimension = 0
        if has_stock:
            dimension = 1 * (self.states_per_dim ** self.dim)
        dimension += dim[4] * (self.states_per_dim ** 4)
        dimension += dim[3] * (self.states_per_dim ** 3)
        dimension += dim[2] * (self.states_per_dim ** 2)
        dimension += dim[1] * (self.states_per_dim ** 1)
        dimension += dim[0]
        return dimension

    def get_action(self, row, has_stock):
        state = self.get_state(row, has_stock)

        if random.uniform(0, 1) < self.epsilon:
            action = random.randrange(0, self.actions)
            action = np.argmax(self.q_table[state])
        return action, state

    def update_reward(self, row, has_stock, last_action, last_state, reward):
        next_state = self.get_state(row, has_stock)

        old_value = self.q_table[last_state, last_action]
        next_max = np.max(self.q_table[next_state])

        new_value = (1 - self.alpha) * old_value + self.alpha * (reward + self.gamma * next_max)
        self.q_table[last_state, last_action] = new_value

Now we have the full code to try it out (the full code is at the end of the tutorial).

Step 7: Training the model

Now we need to train the model.

For that purpose, I have made a list of 134 stocks that I used and placed them in a CSV file.

Then the training is simply to read 1 of the 134 stocks in with 10 years of historical data. Find an 1 year window and run the algorithm on it.

The repeat.

f __name__ == "__main__":
    # source:
    csv_stock_file = 'DK-Stocks.csv'

    while True:
        iterations = 1000
        for i in range(iterations):
            # Go at most 9 years back, as we only have 10 years available and need 1 year of data
            days_back = random.randrange(0, 9*365)

Then let it run and run and run and run again.

Step 8: Testing the algorithm

Of course, the testing should be done on unknown data. That is a stock it does not know. But you cannot also re-run on the same stock, as it will learn from it (unless you do not save the state from it).

Hence, I chose a good performing stock to see how it would do, to see if it could beat the buy-first-day-and-sell-last-day strategy.

The results of the trading bot on Apple stocks.

The return of 1,000,000$ investment with the Trading Bot was approximately 1,344,500$. This is a return on 34% for one year. Comparing that with the stock price itself.

Stock price was 201.55$ on July 1st 2019 and 362.09$ on June 30th, 2020. This would give the following return (0,10% in brokerage should be included in calculations as the Trading bot pays that on each sell and buy).

  • 1,792,847$

That does not look that good. That means that a simple strategy to buy on day one and sell on the last day would return more than the bot.

Of course, you can’t conclude it is not possible to do better on other stocks, but for this case it was not impressive.

Variations and next step

There are many variable to adjust, I especially think I set the gamma too low. There are other parameters to use to make the state. Can remove some, that might be making noice, and add ones that are more relevant. Also, the number of bins can be adjusted. That the bins are made independent of each other, might also be a problem.

Also read the tutorial on reinforcement learning.

Leave a Reply