What will we cover in this tutorial
- To create a machine learning trading bot in Python
- How to build a simple Reinforcement Learning Trading bot.
- The idea behind the Reinforcement Learning trading bot
- Evaluate how the trading bot performs
Machine Learning and Trading?
First thing first. Machine Learning trading bot? Machine Learning can be used for various things in regards to trading.
- Some claim that Machine Learning has difficulties in Day-trading as it sees the market as noise.
- Stock market clustering with K-Means.
- Also, some claim that Machine Learning can help traders, but not beat them.
Well, good to set our expectations. This tutorial is also experimental and does not claim to make a bullet-proof Machine Learning Trading bot that will make you rich. I strongly advice you not to use it for automated trading.
This tutorial is only intended to test and learn about how a Reinforcement Learning strategy can be used to build a Machine Learning Trading Bot.
Step 1: The idea behind the Reinforcement Learning strategy
I wanted to test how a Reinforcement Learning algorithm would do in the market.
First let us understand what Reinforcement Learning is. Reinforcement learning teaches the machine to think for itself based on past action rewards.

It is like training a dog. You and the dog do not talk the same language, but the dogs learns how to act based on rewards (and punishment, which I do not advise or advocate).
Hence, if a dog is rewarded for a certain action in a given situation, then next time it is exposed to a similar situation it will act the same.
Translate that to Reinforcement Learning.
- The agent is the dog that is exposed to the environment.
- Then the agent encounters a state.
- The agent performs an action to transition from that state to a new state.
- Then after the transition the agent receives a reward or penalty(punishment).
- This forms a policy to create a strategy to choose actions in a given state.
That turns out to fit well with trading, or potentially? That is what I want to investigate.
Step 2: The idea behind how to use Reinforcement Learning in Trading
The environment in trading could be translated to rewards and penalties (punishment). You win or loose on the stock market, right?
But we also want to simplify the environment for the bot, not to make it too complex. Hence, in this experiment, the bot is only knows 1 stock and has to decide to buy, keep or sell.
Said differently.
- The trading bot (agent) is exposed to the stock history (environment).
- Then the trading bot (agent) encounters the new stock price (state).
- The trading bot (agent) then performs a choice to keep, sell or buy (action), which brings it to a new state.
- Then the trading bot (agent) will receives a reward based on the value difference from day to day.
The reward will often first be encountered after some time, hence, the feedback from steps after should be set high. Or at least, that is my expectation.
Step 3: Understand Q-learning as the Reinforcement Learning model
The Q-learning model is easy to understand and has potential to be very powerful. Of course, it is not better than the design of it. But before we can design it, we need to understand the mechanism behind it.

The Q-Learning algorithm has a Q-table (a Matrix of dimension state x actions – don’t worry if you do not understand what a Matrix is, you will not need the mathematical aspects of it – it is just an indexed “container” with numbers).
- The agent (or Q-Learning algorithm) will be in a state.
- Then in each iteration the agent needs take an action.
- The agent will continuously update the reward in the Q-table.
- The learning can come from either exploiting or exploring.
This translates into the following pseudo algorithm for the Q-Learning.
The agent is in a given state and needs to choose an action.
- Initialise the Q-table to all zeros
- Iterate:
- Agent is in state state.
- With probability epsilon choose to explore, else exploit.
- If explore, then choose a random action.
- If exploit, then choose the best action based on the current Q-table.
- Update the Q-table from the new reward to the previous state.
- Q[state, action] = (1 – alpha) * Q[state, action] + alpha * (reward+ gamma * max(Q[new_state]) — Q[state, action])
As you can se, we have introduced the following variables.
- epsilon: the probability to take a random action, which is done to explore new territory.
- alpha: is the learning rate that the algorithm should make in each iteration and should be in the interval from 0 to 1.
- gamma: is the discount factor used to balance the immediate and future reward. This value is usually between 0.8 and 0.99
- reward: is the feedback on the action and can be any number. Negative is penalty (or punishment) and positive is a reward.
Step 4: The choices we need to take
Based on that, we need to see how the algorithm should map the stock information to a state. We want the model to be fairly simple and not have too many states, as it will take long time to populate it with data.
There are many parameters to choose from here. As we do not want to tell the algorithm what to do, we still need to feed it what what we find as relevant data.
In this case it was the following.
- Volatility of the share.
- The percentage change of the daily short mean (average over last 20 days).
- Then the percentage of the daily long mean (average over the last 100 days).
- The daily long mean, which is the average over the last 100 days.
- The volume of the sales that day.
These values need to be calculated for the share we use. That can be done by the following code.
import pandas_datareader as pdr
import numpy as np
VALUE = 'Adj Close'
ID = 'id'
NAME = 'name'
DATA = 'data'
def get_data(name, years_ago):
start = dt.datetime.now() - relativedelta(years=years_ago)
end = dt.datetime.now()
df = pdr.get_data_yahoo(name, start, end)
return df
def process():
stock = {ID: stock, NAME: 'AAPL'}
stock[DATA] = get_data(stock[ID], 20)
# Updatea it will all values
stock[DATA]['Short Mean'] = stock[DATA][VALUE].rolling(window=short_window).mean()
stock[DATA]['Long Mean'] = stock[DATA][VALUE].rolling(window=long_window).mean()
stock[DATA]['Daily Change'] = stock[DATA][VALUE].pct_change()
stock[DATA]['Daily Short Change'] = stock[DATA]['Short Mean'].pct_change()
stock[DATA]['Daily Long Change'] = stock[DATA]['Long Mean'].pct_change()
stock[DATA]['Volatility'] = stock[DATA]['Daily Change'].rolling(75).std()*np.sqrt(75)
As you probably notice, this will create a challenge. You need to put them into bins, that is a fixed number of “boxes” to fit in.
def process():
#...
# Let's put data in bins
stock[DATA]['Vla bin'] = pd.cut(stock[DATA]['Volatility'], bins=STATES_DIM, labels=False)
stock[DATA]['Srt ch bin'] = pd.cut(stock[DATA]['Daily Short Change'], bins=STATES_DIM, labels=False)
stock[DATA]['Lng ch bin'] = pd.cut(stock[DATA]['Daily Long Change'], bins=STATES_DIM, labels=False)
# stock[DATA]['Srt mn bin'] = pd.cut(stock[DATA]['Short Mean'], bins=DIM, labels=False)
stock[DATA]['Lng mn bin'] = pd.cut(stock[DATA]['Long Mean'], bins=STATES_DIM, labels=False)
stock[DATA]['Vol bin'] = pd.cut(stock[DATA]['Volume'], bins=STATES_DIM, labels=False)
This will quantify the 5 dimensions into STATES_DIM, which you can define to what you think is appropriate.
Step 5: How to model it
This can be done by creating an environment, that will play the role as your trading account.
class Account:
def __init__(self, cash=1000000, brokerage=0.001):
self.cash = cash
self.brokerage = brokerage
self.stocks = 0
self.stock_id = None
self.has_stocks = False
def get_value(self, row):
if self.has_stocks:
return self.cash + row[VALUE] * self.stocks
else:
return self.cash
def buy_stock(self, stock_id, row):
if self.has_stocks:
return
self.stock_id = stock_id
self.stocks = int(self.cash // (row[VALUE]*(1.0 + self.brokerage)))
self.cash -= self.stocks*row[VALUE]*1.001
self.has_stocks = True
self.print_status(row, "Buy")
def sell_stock(self, row):
if not self.has_stocks:
return
self.print_status(row, "Sell")
self.cash += self.stocks * (row[VALUE]*(1.0 - self.brokerage))
self.stock_id = None
self.stocks = 0
self.has_stocks = False
def print_status(self, row, title="Status"):
if self.has_stocks:
print(title, self.stock_id, "TOTAL:", self.cash + self.stocks*float(row[VALUE]))
print(" - ", row.name, "price", row[VALUE])
print(" - ", "Short", row['Daily Short Change'])
print(" - ", "Long", row['Daily Long Change'])
else:
print(title, "TOTAL", self.cash)
Then it should be iterated over a time where the trading bot can decide what to do.
def process():
# Now let's prepare our model
q_learning = QModel()
account = Account()
state = None
reward = 0.0
action = 0
last_value = 0.0
for index, row in stock[DATA].iterrows():
if state is not None:
# The reward is the immediate return
reward = account.get_value(row) - last_value
# You update the day after the action, when you know the results of your actions
q_learning.update_reward(row, account.has_stocks, action, state, reward)
action, state = q_learning.get_action(row, account.has_stocks)
if action == 0:
pass
elif action == 1:
if account.has_stocks:
account.sell_stock(row)
else:
account.buy_stock(stock[ID], row)
last_value = account.get_value(row)
account.print_status(row)
q_learning.save_pickle()
return last_value
This code will do what ever the trading bot tells you to do.
Step 6: The Q-learning model
Now to the core of the thing. The actual trading bot, that knows nothing about trading. But can we train it to earn money on trading and how much? We will see that later.
class QModel:
def __init__(self, alpha=0.5, gamma=0.7, epsilon=0.1):
self.alpha = alpha
self.gamma = gamma
self.epsilon = epsilon
self.states_per_dim = STATES_DIM
self.dim = 5
self.states = (self.states_per_dim ** self.dim) * 2
self.actions = 2
self.pickle = "q_model7.pickle"
self.q_table = np.zeros((self.states, self.actions))
if os.path.isfile(self.pickle):
print("Loading pickle")
with open(self.pickle, "rb") as f:
self.q_table = pickle.load(f)
def save_pickle(self):
with open(self.pickle, "wb") as f:
pickle.dump(self.q_table, f)
def get_state(self, row, has_stock):
dim = []
dim.append(int(row['Vla bin']))
dim.append(int(row['Srt ch bin']))
dim.append(int(row['Lng ch bin']))
dim.append(int(row['Lng mn bin']))
dim.append(int(row['Vol bin']))
for i in range(len(dim)):
if dim[i] is None:
dim[i] = 0
dimension = 0
if has_stock:
dimension = 1 * (self.states_per_dim ** self.dim)
dimension += dim[4] * (self.states_per_dim ** 4)
dimension += dim[3] * (self.states_per_dim ** 3)
dimension += dim[2] * (self.states_per_dim ** 2)
dimension += dim[1] * (self.states_per_dim ** 1)
dimension += dim[0]
return dimension
def get_action(self, row, has_stock):
state = self.get_state(row, has_stock)
if random.uniform(0, 1) < self.epsilon:
action = random.randrange(0, self.actions)
else:
action = np.argmax(self.q_table[state])
return action, state
def update_reward(self, row, has_stock, last_action, last_state, reward):
next_state = self.get_state(row, has_stock)
old_value = self.q_table[last_state, last_action]
next_max = np.max(self.q_table[next_state])
new_value = (1 - self.alpha) * old_value + self.alpha * (reward + self.gamma * next_max)
self.q_table[last_state, last_action] = new_value
Now we have the full code to try it out (the full code is at the end of the tutorial).
Step 7: Training the model
Now we need to train the model.
For that purpose, I have made a list of 134 stocks that I used and placed them in a CSV file.
Then the training is simply to read 1 of the 134 stocks in with 10 years of historical data. Find an 1 year window and run the algorithm on it.
The repeat.
f __name__ == "__main__":
# source: http://www.nasdaqomxnordic.com/shares/listed-companies/copenhagen
csv_stock_file = 'DK-Stocks.csv'
while True:
iterations = 1000
for i in range(iterations):
# Go at most 9 years back, as we only have 10 years available and need 1 year of data
days_back = random.randrange(0, 9*365)
process(csv_stock_file)
Then let it run and run and run and run again.
Step 8: Testing the algorithm
Of course, the testing should be done on unknown data. That is a stock it does not know. But you cannot also re-run on the same stock, as it will learn from it (unless you do not save the state from it).
Hence, I chose a good performing stock to see how it would do, to see if it could beat the buy-first-day-and-sell-last-day strategy.

The return of 1,000,000$ investment with the Trading Bot was approximately 1,344,500$. This is a return on 34% for one year. Comparing that with the stock price itself.
Stock price was 201.55$ on July 1st 2019 and 362.09$ on June 30th, 2020. This would give the following return (0,10% in brokerage should be included in calculations as the Trading bot pays that on each sell and buy).
- 1,792,847$
That does not look that good. That means that a simple strategy to buy on day one and sell on the last day would return more than the bot.
Of course, you can’t conclude it is not possible to do better on other stocks, but for this case it was not impressive.
Variations and next step
There are many variable to adjust, I especially think I set the gamma too low. There are other parameters to use to make the state. Can remove some, that might be making noice, and add ones that are more relevant. Also, the number of bins can be adjusted. That the bins are made independent of each other, might also be a problem.
Also read the tutorial on reinforcement learning.
Python Circle
Do you know what the 5 key success factors every programmer must have?
How is it possible that some people become programmer so fast?
While others struggle for years and still fail.
Not only do they learn python 10 times faster they solve complex problems with ease.
What separates them from the rest?
I identified these 5 success factors that every programmer must have to succeed:
- Collaboration: sharing your work with others and receiving help with any questions or challenges you may have.
- Networking: the ability to connect with the right people and leverage their knowledge, experience, and resources.
- Support: receive feedback on your work and ask questions without feeling intimidated or judged.
- Accountability: stay motivated and accountable to your learning goals by surrounding yourself with others who are also committed to learning Python.
- Feedback from the instructor: receiving feedback and support from an instructor with years of experience in the field.
I know how important these success factors are for growth and progress in mastering Python.
That is why I want to make them available to anyone struggling to learn or who just wants to improve faster.
With the Python Circle community, you can take advantage of 5 key success factors every programmer must have.

Be part of something bigger and join the Python Circle community.
Hello,
very nice introduction to RL with examples 🙂
May I ask where could I get whole code in this turtorial?
Hi Blaz,
The full code is actually there. You need to take all the pieces of code and put them together. Then it should be fully functional.
Cheers,
Rune
Actually, there are a lot of dependencies that are missing and non of the process function has an input parameter, although there is one in the last part of the code.
I don’t understand what you mean with the days_back variable, as it is not used anywhere?
Hi Blaz,
Yes – I can see that. It is a mess. It is needs to be updated. I see there is a lot missing there. Also, it is using the same function names multiple times.
Sorry about that. I will see when I get time to update it. Sorry.
Cheers,
Rune
Looking forward to it!
Do you maybe have a git repo and you could make it public?
Great.
Hi Rune,
Just wondering if the complete (cleaned-up) source code was ever made available?
Thanks