## What will we cover in this tutorial?

A Forest Classifier is an approach to minimize the heavy bias a Decision Tree can get. A forest classifier simply contains a set of decision trees and uses majority voting to make the prediction.

In this tutorial we will try to use that on the stock market, by creating a few indicators. This tutorial will give a framework to explore if it can predict the direction of a stock. Given a set of indicators, will the stock go up or down the next trading day.

This is a simplified problem of predicting the actual stock value the next day.

## Step 1: Getting data and calculate some indicators

If you are new to stock indicators, we can highly recommend you to read about the MACD, RSI, Stochastic Oscillator, where the MACD also includes how to calculate the EMA. Here we assume familiarity to those indicators. Also, that you are familiar with Pandas DataFrames and Pandad-datareader.

import pandas_datareader as pdr import datetime as dt import numpy as np ticker = "^GSPC" # The S&P 500 index data = pdr.get_data_yahoo(ticker, dt.datetime(2010,1,1), dt.datetime.now(), interval='d') # Calculate the EMA10 > EMA30 signal ema10 = data['Close'].ewm(span=10).mean() ema30 = data['Close'].ewm(span=30).mean() data['EMA10gtEMA30'] = np.where(ema10 > ema30, 1, -1) # Calculate where Close is > EMA10 data['ClGtEMA10'] = np.where(data['Close'] > ema10, 1, -1) # Calculate the MACD signal exp1 = data['Close'].ewm(span=12).mean() exp2 = data['Close'].ewm(span=26).mean() macd = exp1 - exp2 macd_signal = macd.ewm(span=9).mean() data['MACD'] = macd_signal - macd # Calculate RSI delta = data['Close'].diff() up = delta.clip(lower=0) down = -1*delta.clip(upper=0) ema_up = up.ewm(com=13, adjust=False).mean() ema_down = down.ewm(com=13, adjust=False).mean() rs = ema_up/ema_down data['RSI'] = 100 - (100/(1 + rs)) # Stochastic Oscillator high14= data['High'].rolling(14).max() low14 = data['Low'].rolling(14).min() data['%K'] = (data['Close'] - low14)*100/(high14 - low14) # Williams Percentage Range data['%R'] = -100*(high14 - data['Close'])/(high14 - low14) days = 6 # Price Rate of Change ct_n = data['Close'].shift(days) data['PROC'] = (data['Close'] - ct_n)/ct_n print(data)

The choice of indicators is arbitrary but among some popular ones. It should be up to you to change them to other indicators and experiment with them.

High Low Open Close Volume Adj Close EMA10gtEMA30 ClGtEMA10 MACD RSI %K %R PROC Date 2020-08-17 3387.590088 3379.219971 3380.860107 3381.989990 3671290000 3381.989990 1 1 -2.498718 68.294286 96.789344 -3.210656 0.009164 2020-08-18 3395.060059 3370.149902 3387.040039 3389.780029 3881310000 3389.780029 1 1 -1.925573 69.176468 97.234576 -2.765424 0.008722 2020-08-19 3399.540039 3369.659912 3392.510010 3374.850098 3884480000 3374.850098 1 1 -0.034842 65.419555 86.228281 -13.771719 0.012347 2020-08-20 3390.800049 3354.689941 3360.479980 3385.510010 3642850000 3385.510010 1 1 0.949607 66.805725 87.801036 -12.198964 0.001526 2020-08-21 3399.959961 3379.310059 3386.010010 3397.159912 3705420000 3397.159912 1 1 1.249066 68.301209 97.534948 -2.465052 0.007034

## Step 2: Understand the how the Decision Tree works

Trees are the foundation in the Forest. Or Decision Trees are the foundation in a Forest Classifier. Hence, it is a good starting point to understand how a Decision Tree works. Luckily, they are quite easy to understand.

Let’s try to investigate a Decision Tree that is based on two of the indicators above. We take the **RSI** (Relative Strength Index) and **%K** (Stochastic Oscillator). A Decision Tree could look like this (depending on the training data).

When we get a new data row with **%K** and **RSI** indicators, it will start at the top of the Decision Tree.

- At the first node it will check if
**%K <= 4.615**, if so, take the left child otherwise the right child. - The
**gini**tells us how a randomly chosen element would be incorrectly labeled. Hence, a low value close to 0 is good. **Samples**tells us how many of the samples of the training set reached this node.- Finally, the
**value**tells us how the values are distributed. In the final decision nodes, the category of most element is the prediction.

Looking at the above Decision Tree, it does not seem to be very good. The majority of samples end up the fifth node with a **gini** on **0.498**, close to random, right? And it will label it **1**, growth.

But this is the idea with Forest Classifiers, it will take a bunch of Decision Trees, that might not be good, and use majority of them to classify it.

## Step 3: Create the Forest Classifier

Now we understand how the Decision Tree and the Forest Classifier work, we just need to run the magic. As this is done by calling a library function.

import pandas_datareader as pdr import datetime as dt import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, accuracy_score from sklearn.ensemble import RandomForestClassifier ticker = "^GSPC" data = pdr.get_data_yahoo(ticker, dt.datetime(2010,1,1), dt.datetime.now(), interval='d') # Calculate the EMA10 > EMA30 signal ema10 = data['Close'].ewm(span=10).mean() ema30 = data['Close'].ewm(span=30).mean() data['EMA10gtEMA30'] = np.where(ema10 > ema30, 1, -1) # Calculate where Close is > EMA10 data['ClGtEMA10'] = np.where(data['Close'] > ema10, 1, -1) # Calculate the MACD signal exp1 = data['Close'].ewm(span=12).mean() exp2 = data['Close'].ewm(span=26).mean() macd = exp1 - exp2 macd_signal = macd.ewm(span=9).mean() data['MACD'] = macd_signal - macd # Calculate RSI delta = data['Close'].diff() up = delta.clip(lower=0) down = -1*delta.clip(upper=0) ema_up = up.ewm(com=13, adjust=False).mean() ema_down = down.ewm(com=13, adjust=False).mean() rs = ema_up/ema_down data['RSI'] = 100 - (100/(1 + rs)) # Stochastic Oscillator high14= data['High'].rolling(14).max() low14 = data['Low'].rolling(14).min() data['%K'] = (data['Close'] - low14)*100/(high14 - low14) # Williams Percentage Range data['%R'] = -100*(high14 - data['Close'])/(high14 - low14) days = 6 # Price Rate of Change ct_n = data['Close'].shift(days) data['PROC'] = (data['Close'] - ct_n)/ct_n # Set class labels to classify data['Return'] = data['Close'].pct_change(1).shift(-1) data['class'] = np.where(data['Return'] > 0, 1, 0) # Clean for NAN rows data = data.dropna() # Minimize dataset data = data.iloc[-200:] # Data to predict predictors = ['EMA10gtEMA30', 'ClGtEMA10', 'MACD', 'RSI', '%K', '%R', 'PROC'] X = data[predictors] y = data['class'] # Split data into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30) # Train the model rfc = RandomForestClassifier(random_state=0) rfc = rfc.fit(X_train, y_train) # Test the model by doing some predictions y_pred = rfc.predict(X_test) # See how accurate the predictions are report = classification_report(y_test, y_pred) print('Model accuracy', accuracy_score(y_test, y_pred, normalize=True)) print(report)

First some notes on a few lines. The **train_test_split**, divides the data into training set and test set. The test set is set to be 30% of the data. It does it in a randomized way.

Next we create a RandomForestClassifier and fit it.

Then we use our newly created classifier (**rfc**) to predict on the test set (**X_test**).

Finally, we calculate the accuracy and the report.

Model accuracy 0.6333333333333333 precision recall f1-score support 0 0.56 0.38 0.45 24 1 0.66 0.81 0.73 36 accuracy 0.63 60 macro avg 0.61 0.59 0.59 60 weighted avg 0.62 0.63 0.62 60

The model accuracy is **0.63**, which seems quite good. It is better than random, at least. You can also see that the precision of **1** (growth) is higher than **0** (loss, or negative growth), with **0.66** and **0.56**, respectively.

Does that mean it is all good and we can beat the market?

No, far from. Also, notice I chose to only use the last 200 stock days in my experiment out of the 2.500+ possible stock days.

Running a few experiments it showed that it the prediction was close to **50%** if all days were used. That means, basically it was not possible to predict.

## Step 4: A few more tests on stocks

I have run a few experiments on different stocks and also varying the number of days used.

Stock | 100 days | 200 Days | 400 Days |

S&P 500 | 0.53 | 0.63 | 0.52 |

AAPL | 0.53 | 0.62 | 0.54 |

F | 0.67 | 0.57 | 0.54 |

KO | 0.47 | 0.52 | 0.53 |

IBM | 0.57 | 0.52 | 0.57 |

MSFT | 0.50 | 0.50 | 0.48 |

AMZN | 0.57 | 0.47 | 0.58 |

TSLA | 0.50 | 0.60 | 0.53 |

NVDA | 0.57 | 0.53 | 0.54 |

Looking in the above table I am not convinced about my hypotheses. First, the 200 days to be better, might have be specific on the stock. Also, if you re-run tests you get new numbers, as the training and test dataset are different from time to time.

I did try a few with the full dataset, and I still think it performed worse (all close to 0.50).

The above looks fine, as it mostly can predict better than just guessing. But still there are a few cases where it is not the case.

## Next steps

A few things to remember here.

Firstly, the indicators are chose at random from among the common ones. A further investigation on this could be an idea. It can highly bias the results if it is used does not help the prediction.

Secondly, I might have falsely hypothesized that it was more accurate when we limited to data to a smaller set than the original set.

Thirdly, it could be that the stocks are also having a bias in one direction. If we limit to a smaller period, a bull market will primarily have growth days, hence a biased guess on growth will be better than **0.50**. This factor should be investigated further, to see if this favors the predictions.