RandomForestClassifier: Predict Stock Market Direction

What will we cover in this tutorial?

A Forest Classifier is an approach to minimize the heavy bias a Decision Tree can get. A forest classifier simply contains a set of decision trees and uses majority voting to make the prediction.

In this tutorial we will try to use that on the stock market, by creating a few indicators. This tutorial will give a framework to explore if it can predict the direction of a stock. Given a set of indicators, will the stock go up or down the next trading day.

This is a simplified problem of predicting the actual stock value the next day.

Step 1: Getting data and calculate some indicators

If you are new to stock indicators, we can highly recommend you to read about the MACD, RSI, Stochastic Oscillator, where the MACD also includes how to calculate the EMA. Here we assume familiarity to those indicators. Also, that you are familiar with Pandas DataFrames and Pandad-datareader.

import pandas_datareader as pdr
import datetime as dt
import numpy as np

ticker = "^GSPC" # The S&P 500 index
data = pdr.get_data_yahoo(ticker, dt.datetime(2010,1,1), dt.datetime.now(), interval='d')

# Calculate the EMA10 > EMA30 signal
ema10 = data['Close'].ewm(span=10).mean()
ema30 = data['Close'].ewm(span=30).mean()
data['EMA10gtEMA30'] = np.where(ema10 > ema30, 1, -1)

# Calculate where Close is > EMA10
data['ClGtEMA10'] = np.where(data['Close'] > ema10, 1, -1)

# Calculate the MACD signal
exp1 = data['Close'].ewm(span=12).mean()
exp2 = data['Close'].ewm(span=26).mean()
macd = exp1 - exp2
macd_signal = macd.ewm(span=9).mean()
data['MACD'] = macd_signal - macd

# Calculate RSI
delta = data['Close'].diff()
up = delta.clip(lower=0)
down = -1*delta.clip(upper=0)
ema_up = up.ewm(com=13, adjust=False).mean()
ema_down = down.ewm(com=13, adjust=False).mean()
rs = ema_up/ema_down
data['RSI'] = 100 - (100/(1 + rs))

# Stochastic Oscillator
high14= data['High'].rolling(14).max()
low14 = data['Low'].rolling(14).min()
data['%K'] = (data['Close'] - low14)*100/(high14 - low14)

# Williams Percentage Range
data['%R'] = -100*(high14 - data['Close'])/(high14 - low14)

days = 6

# Price Rate of Change
ct_n = data['Close'].shift(days)
data['PROC'] = (data['Close'] - ct_n)/ct_n

print(data)

The choice of indicators is arbitrary but among some popular ones. It should be up to you to change them to other indicators and experiment with them.

                  High         Low        Open       Close       Volume   Adj Close  EMA10gtEMA30  ClGtEMA10      MACD         RSI         %K        %R      PROC
Date                                                                                                                                                             

2020-08-17  3387.590088  3379.219971  3380.860107  3381.989990  3671290000  3381.989990             1          1 -2.498718   68.294286  96.789344  -3.210656  0.009164
2020-08-18  3395.060059  3370.149902  3387.040039  3389.780029  3881310000  3389.780029             1          1 -1.925573   69.176468  97.234576  -2.765424  0.008722
2020-08-19  3399.540039  3369.659912  3392.510010  3374.850098  3884480000  3374.850098             1          1 -0.034842   65.419555  86.228281 -13.771719  0.012347
2020-08-20  3390.800049  3354.689941  3360.479980  3385.510010  3642850000  3385.510010             1          1  0.949607   66.805725  87.801036 -12.198964  0.001526
2020-08-21  3399.959961  3379.310059  3386.010010  3397.159912  3705420000  3397.159912             1          1  1.249066   68.301209  97.534948  -2.465052  0.007034

Step 2: Understand the how the Decision Tree works

Trees are the foundation in the Forest. Or Decision Trees are the foundation in a Forest Classifier. Hence, it is a good starting point to understand how a Decision Tree works. Luckily, they are quite easy to understand.

Let’s try to investigate a Decision Tree that is based on two of the indicators above. We take the RSI (Relative Strength Index) and %K (Stochastic Oscillator). A Decision Tree could look like this (depending on the training data).

Decision Tree for %K and RSI

When we get a new data row with %K and RSI indicators, it will start at the top of the Decision Tree.

  • At the first node it will check if %K <= 4.615, if so, take the left child otherwise the right child.
  • The gini tells us how a randomly chosen element would be incorrectly labeled. Hence, a low value close to 0 is good.
  • Samples tells us how many of the samples of the training set reached this node.
  • Finally, the value tells us how the values are distributed. In the final decision nodes, the category of most element is the prediction.

Looking at the above Decision Tree, it does not seem to be very good. The majority of samples end up the fifth node with a gini on 0.498, close to random, right? And it will label it 1, growth.

But this is the idea with Forest Classifiers, it will take a bunch of Decision Trees, that might not be good, and use majority of them to classify it.

Step 3: Create the Forest Classifier

Now we understand how the Decision Tree and the Forest Classifier work, we just need to run the magic. As this is done by calling a library function.

import pandas_datareader as pdr
import datetime as dt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.ensemble import RandomForestClassifier


ticker = "^GSPC"
data = pdr.get_data_yahoo(ticker, dt.datetime(2010,1,1), dt.datetime.now(), interval='d')

# Calculate the EMA10 > EMA30 signal
ema10 = data['Close'].ewm(span=10).mean()
ema30 = data['Close'].ewm(span=30).mean()
data['EMA10gtEMA30'] = np.where(ema10 > ema30, 1, -1)

# Calculate where Close is > EMA10
data['ClGtEMA10'] = np.where(data['Close'] > ema10, 1, -1)

# Calculate the MACD signal
exp1 = data['Close'].ewm(span=12).mean()
exp2 = data['Close'].ewm(span=26).mean()
macd = exp1 - exp2
macd_signal = macd.ewm(span=9).mean()
data['MACD'] = macd_signal - macd

# Calculate RSI
delta = data['Close'].diff()
up = delta.clip(lower=0)
down = -1*delta.clip(upper=0)
ema_up = up.ewm(com=13, adjust=False).mean()
ema_down = down.ewm(com=13, adjust=False).mean()
rs = ema_up/ema_down
data['RSI'] = 100 - (100/(1 + rs))

# Stochastic Oscillator
high14= data['High'].rolling(14).max()
low14 = data['Low'].rolling(14).min()
data['%K'] = (data['Close'] - low14)*100/(high14 - low14)

# Williams Percentage Range
data['%R'] = -100*(high14 - data['Close'])/(high14 - low14)

days = 6

# Price Rate of Change
ct_n = data['Close'].shift(days)
data['PROC'] = (data['Close'] - ct_n)/ct_n

# Set class labels to classify
data['Return'] = data['Close'].pct_change(1).shift(-1)
data['class'] = np.where(data['Return'] > 0, 1, 0)

# Clean for NAN rows
data = data.dropna()
# Minimize dataset
data = data.iloc[-200:]


# Data to predict
predictors = ['EMA10gtEMA30', 'ClGtEMA10', 'MACD', 'RSI', '%K', '%R', 'PROC']
X = data[predictors]
y = data['class']

# Split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

# Train the model
rfc = RandomForestClassifier(random_state=0)
rfc = rfc.fit(X_train, y_train)

# Test the model by doing some predictions
y_pred = rfc.predict(X_test)

# See how accurate the predictions are
report = classification_report(y_test, y_pred)
print('Model accuracy', accuracy_score(y_test, y_pred, normalize=True))
print(report)

First some notes on a few lines. The train_test_split, divides the data into training set and test set. The test set is set to be 30% of the data. It does it in a randomized way.

Next we create a RandomForestClassifier and fit it.

Then we use our newly created classifier (rfc) to predict on the test set (X_test).

Finally, we calculate the accuracy and the report.

Model accuracy 0.6333333333333333
              precision    recall  f1-score   support

           0       0.56      0.38      0.45        24
           1       0.66      0.81      0.73        36

    accuracy                           0.63        60
   macro avg       0.61      0.59      0.59        60
weighted avg       0.62      0.63      0.62        60

The model accuracy is 0.63, which seems quite good. It is better than random, at least. You can also see that the precision of 1 (growth) is higher than 0 (loss, or negative growth), with 0.66 and 0.56, respectively.

Does that mean it is all good and we can beat the market?

No, far from. Also, notice I chose to only use the last 200 stock days in my experiment out of the 2.500+ possible stock days.

Running a few experiments it showed that it the prediction was close to 50% if all days were used. That means, basically it was not possible to predict.

Step 4: A few more tests on stocks

I have run a few experiments on different stocks and also varying the number of days used.

Stock100 days200 Days400 Days
S&P 5000.530.630.52
AAPL0.530.620.54
F0.670.570.54
KO0.470.520.53
IBM0.570.520.57
MSFT0.500.500.48
AMZN0.570.470.58
TSLA0.500.600.53
NVDA0.570.530.54
The accuracy

Looking in the above table I am not convinced about my hypotheses. First, the 200 days to be better, might have be specific on the stock. Also, if you re-run tests you get new numbers, as the training and test dataset are different from time to time.

I did try a few with the full dataset, and I still think it performed worse (all close to 0.50).

The above looks fine, as it mostly can predict better than just guessing. But still there are a few cases where it is not the case.

Next steps

A few things to remember here.

Firstly, the indicators are chose at random from among the common ones. A further investigation on this could be an idea. It can highly bias the results if it is used does not help the prediction.

Secondly, I might have falsely hypothesized that it was more accurate when we limited to data to a smaller set than the original set.

Thirdly, it could be that the stocks are also having a bias in one direction. If we limit to a smaller period, a bull market will primarily have growth days, hence a biased guess on growth will be better than 0.50. This factor should be investigated further, to see if this favors the predictions.

How to Visualize Time Series Financial Data with Python in 3 Easy Steps

What will we cover

  • The easy way visualize financial data with Python
  • How to fetch data from stocks in Python
  • The easy way to visualize it on a plot
  • How to enrich the data with valuable analysis

Step 1: Fetch the data on your favorite stock

As with most things in Python, somebody made an easy to use library to do all the hard work for you. This is also true if you want to visualize financial data with Python.

The Pandas datareader library lets you fetch financial data from various sources and return them in a Pandas Dataframe.

If you do not know what a Dataframe from Pandas is, do not worry. We will cover the necessary here.

The full list data sources available can be found here, but include Tiingo, IEX, Alpha Vantage, Enigma, Quandl, St.Louis FED (FRED), Kenneth French’s data library, World Bank, and many more.

import pandas_datareader as pdr
import datetime 


aapl = pdr.get_data_yahoo('AAPL', 
                          start=datetime.datetime(2010, 1, 1), 
                          end=datetime.datetime(2020, 1, 1))

print(aapl.head())

Which will result in 10 years of data from Apple. See below.

                 High        Low  ...       Volume  Adj Close
Date                              ...                        
2010-01-04  30.642857  30.340000  ...  123432400.0  26.466835
2010-01-05  30.798571  30.464285  ...  150476200.0  26.512596
2010-01-06  30.747143  30.107143  ...  138040000.0  26.090879
2010-01-07  30.285715  29.864286  ...  119282800.0  26.042646
2010-01-08  30.285715  29.865715  ...  111902700.0  26.215786

For each bank day you will be presented with the following data.

High         2.936800e+02
Low          2.895200e+02
Open         2.899300e+02
Close        2.936500e+02
Volume       2.520140e+07
Adj Close    2.921638e+02
Name: 2019-12-31 00:00:00, dtype: float64
  • High: The highest price traded during that day.
  • Low: The lowest price traded during that day.
  • Open: The opening price that day.
  • Close: The closing price that day, that is the price of the last trade that day.
  • Volume: The number of shares that exchange hands for the stock that day.
  • Adj Close: It accurately reflect the stock’s value after accounting for any corporate actions. It is considered to be the true price of that stock and is often used when examining historical returns.

As you can see, to make further infestations on the data, you should use the Adj Close.

Step 2: Visualize the data

This is where Dataframes come in handy. They integrate easy with matplotlib, which is a comprehensive library for creating static, animated, and interactive visualizations in Python.

import pandas_datareader as pdr
import datetime 
import matplotlib.pyplot as plt


aapl = pdr.get_data_yahoo('AAPL', 
                          start=datetime.datetime(2010, 1, 1), 
                          end=datetime.datetime(2020, 1, 1))

aapl['Adj Close'].plot()
plt.show()

Will result in a graph like this.

Apple stock price.

That was easy.

You can see further ways to visualize the data in the documentation here.

Step 3: Enrich the data

That is another great advantage of Dataframes, it is easy to enrich it with valuable data.

A good example is to enrich the data with running mean values of the stock price.

import pandas_datareader as pdr
import datetime 
import matplotlib.pyplot as plt


aapl = pdr.get_data_yahoo('AAPL', 
                          start=datetime.datetime(2015, 1, 1), 
                          end=datetime.datetime(2020, 1, 1))

aapl['Mean Short'] = aapl['Adj Close'].rolling(window=20).mean()
aapl['Mean Long'] = aapl['Adj Close'].rolling(window=100).mean()
aapl[['Adj Close', 'Mean Short', 'Mean Long']].plot()
plt.show()

Which will result in the following graph.

Apple stock with rolling mean values with window of 20 and 100 days.

Now that was simple. The code simple rolls the window of 20 (or 100) days and sets the average (mean) value of that. This kind of analysis is used to see the average trend of the stock. Also, to see when the short term trend (20 days) is crossing the long term trend (100 days).

A simple trading strategy is decides on to buy and sell based on when these averages crosses. This strategy is called dual moving average crossover. You can read more about it here.

A volatility analysis is used to see how “stable” the share is. The higher the volatility value is, the riskier it is.

import pandas_datareader as pdr
import datetime 
import matplotlib.pyplot as plt
import numpy as np


aapl = pdr.get_data_yahoo('AAPL', 
                          start=datetime.datetime(2015, 1, 1), 
                          end=datetime.datetime(2020, 1, 1))


daily_close = aapl['Close']
daily_pc = daily_close.pct_change()

# See the volatibility
vol = daily_pc.rolling(75).std()*np.sqrt(75)
vol.plot()
plt.show()

Which results in the following graph.

Volatility of Apple stock

This is a good indication on how risky the stock is.

Conclusion

This is just a simple introduction to how to retrieve financial stock data in Python and visualize it. Also, how easy it is to enrich it with more valuable analysis.

There are so much more to explore and learn about it.

Build a Financial Trading Algorithm in Python in 5 Easy Steps

What will we cover in this tutorial?

Step 1: Get time series data on your favorite stock

To build a financial trading algorithm in Python, it needs to be fed with data. Hence, the first step you need to master is how to collect time series data on your favorite stock. Sounds like it is difficult, right?

Luckily, someone made an awesome library pandas-datareader, which does all the hard word you for you. Let’s take a look on how easy it is.

import datetime as dt
import pandas_datareader as pdr
import matplotlib.pyplot as plt
from dateutil.relativedelta import relativedelta


# A Danish jewellery manufacturer and retailer
stock = 'PNDORA.CO'
end = dt.datetime.now()
start = end - relativedelta(years=10)
pndora = pdr.get_data_yahoo(stock, start, end)

pndora['Close'].plot()

plt.show()

Which results in a time series of the closing price as shown here.

Pandora A/S time series stock price

The stock is probably quite unknown, considering it is a Danish company. But to prove a point that you can get data, including for Pandora.

In the code you see that you send a start and end date to the call fetching the data. Here we have taken the last 10 years. The object returned integrates well with the matplotlib library to make the plot.

To understand the data better, we need to explore further.

Step 2: Understand the data available

Let us explore the data object returned by the call (pndora).

To get an overview you should run the following code using the iloc-call to a the Dataframe object (pndora returned by pandas_datareader).

pndora.iloc[-1]

This will show what the last item of the object looks like.

High            365.000000
Low             355.600006
Open            360.000000
Close           356.500000
Volume       446004.000000
Adj Close       356.500000
Name: 2020-06-26 00:00:00, dtype: float64

Where you have the following items.

  • High: The highest price traded during that day.
  • Low: The lowest price traded during that day.
  • Open: The opening price that day.
  • Close: The closing price that day, that is the price of the last trade that day.
  • Volume: The number of shares that exchange hands for the stock that day.
  • Adj Close: It accurately reflect the stock’s value after accounting for any corporate actions. It is considered to be the true price of that stock and is often used when examining historical returns.

That means, it would be natural to use Adj Close in our calculations. Hence, for each day we have the above information available.

Step 3: Learning how to enrich the data (Pandas)

Pandas? Yes, you read correct. But not a Panda like this the picture.

Pandas the Python library.
Pandas the Python library.

There is an awesome library Pandas in Python to make data analysis easy.

Let us explore some useful things we can do.

import datetime as dt
import pandas_datareader as pdr
import matplotlib.pyplot as plt
from dateutil.relativedelta import relativedelta


# A Danish jewellery manufacturer and retailer
stock = 'PNDORA.CO'
end = dt.datetime.now()
start = end - relativedelta(years=10)
pndora = pdr.get_data_yahoo(stock, start, end)

pndora['Short'] = pndora['Adj Close'].rolling(window=20).mean()
pndora['Long'] = pndora['Adj Close'].rolling(window=100).mean()

pndora[['Adj Close', 'Short', 'Long']].plot()
plt.show()

Which will result in the following graph.

Pandora A/S stock prices with Short mean average and Long mean average.
Pandora A/S stock prices with Short mean average and Long mean average.

If you inspect the code above, you see, that you easily added to two new columns (Short and Long) and computed them to be the mean value of the previous 20 and 100 days, respectively.

Further, you can add the daily percentage change for the various entries.

import datetime as dt
import pandas_datareader as pdr
import matplotlib.pyplot as plt
from dateutil.relativedelta import relativedelta


# A Danish jewellery manufacturer and retailer
stock = 'PNDORA.CO'
end = dt.datetime.now()
start = end - relativedelta(years=10)
pndora = pdr.get_data_yahoo(stock, start, end)

pndora['Short'] = pndora['Adj Close'].rolling(window=20).mean()
pndora['Long'] = pndora['Adj Close'].rolling(window=100).mean()

pndora['Pct Change'] = pndora['Adj Close'].pct_change()
pndora['Pct Short'] = pndora['Short'].pct_change()
pndora['Pct Long'] = pndora['Long'].pct_change()

pndora[['Pct Change', 'Pct Short', 'Pct Long']].loc['2020'].plot()
plt.show()

Which will result in this graph.

Pandora A/S stock's daily percentage change for Adj Close, short and long mean.
Pandora A/S stock’s daily percentage change for Adj Close, short and long mean.

Again you can see how easy it is to add new columns in the Dataframe object provided by Pandas library.

Step 4: Building your strategy to buy and sell stocks

For the example we will keep it simple and only focus on one stock. The strategy we will use is called the dual moving average crossover.

Simply explained, you want to buy stocks when the short mean average is higher than the long mean average value.

In the figure above, it is translates to.

  • Buy when the yellow crosses above the green line.
  • Sell when the yellow crosses below the green line.

To implement the simplest version of that it would be as follows.

import datetime as dt
import pandas_datareader as pdr
import matplotlib.pyplot as plt
import numpy as np
from dateutil.relativedelta import relativedelta


# A Danish jewellery manufacturer and retailer
stock = 'PNDORA.CO'
end = dt.datetime.now()
start = end - relativedelta(years=10)
pndora = pdr.get_data_yahoo(stock, start, end)

short_window = 20
long_window = 100

pndora['Short'] = pndora['Adj Close'].rolling(window=short_window).mean()
pndora['Long'] = pndora['Adj Close'].rolling(window=long_window).mean()

# Let us create some signals
pndora['signal'] = 0.0
pndora['signal'][short_window:] = np.where(pndora['Short'][short_window:] > pndora['Long'][short_window:], 1.0, 0.0)

pndora['positions'] = pndora['signal'].diff()

To visually see where to buy and sell you can use the following code afterwards on pndora.

fig = plt.figure()

ax1 = fig.add_subplot(111, ylabel='Price')

pndora[['Adj Close', 'Short', 'Long']].plot(ax=ax1)

ax1.plot(pndora.loc[pndora.positions == 1.0].index,
         pndora.Short[pndora.positions == 1.0],
         '^', markersize=10, color='m')

ax1.plot(pndora.loc[pndora.positions == -1.0].index,
         pndora.Short[pndora.positions == -1.0],
         'v', markersize=10, color='k')

plt.show()

Which would result in the following graph.

Finally, you need to see how your algorithm performs.

Step 5: Testing you trading algorithm

There are many ways to test an algorithm. Here we go all in each cycle. We buy as much as we can and sell them all when we sell.

import datetime as dt
import pandas_datareader as pdr
import matplotlib.pyplot as plt
import numpy as np
from dateutil.relativedelta import relativedelta


# A Danish jewellery manufacturer and retailer
stock = 'PNDORA.CO'
end = dt.datetime.now()
start = end - relativedelta(years=10)
pndora = pdr.get_data_yahoo(stock, start, end)

short_window = 20
long_window = 100

pndora['Short'] = pndora['Adj Close'].rolling(window=short_window).mean()
pndora['Long'] = pndora['Adj Close'].rolling(window=long_window).mean()

# Let us create some signals
pndora['signal'] = 0.0
pndora['signal'][short_window:] = np.where(pndora['Short'][short_window:] > pndora['Long'][short_window:], 1.0, 0.0)

pndora['positions'] = pndora['signal'].diff()

cash = 1000000
stocks = 0
for index, row in pndora.iterrows():
    if row['positions'] == 1.0:
        stocks = int(cash//row['Adj Close'])
        cash -= stocks*row['Adj Close']
    elif row['positions'] == -1.0:
        cash += stocks*row['Adj Close']
        stocks = 0

print("Total:", cash + stocks*pndora.iloc[-1]['Adj Close'])

Which results in.

Total: 2034433.8826065063

That is a double in 10 years, which is less than 8% per year. Not so good.

As this is one specific stock, it is not fair to judge the algorithm being poor, it can be the stock which was not performing good, or the variables can be further adjusted.

If compared with the scenario where you bought the stocks at day one and sold them on the last day, your earnings would be 1,876,232. Hence, the algorithm beats that.

Conclusion and further work

This is a simple financial trading algorithm in Python and there are variables that can be adjusted. The algorithm was performing better than the naive strategy to buy on day one and sell on the last day.

It could be interesting to add more data into the decision in the algorithm, which might be some future work to do. Also, can it be combined with some Machine Learning?