Learn how you can become a Python programmer in just 12 weeks.

    We respect your privacy. Unsubscribe at anytime.

    RandomForestClassifier: Predict Stock Market Direction

    What will we cover in this tutorial?

    A Forest Classifier is an approach to minimize the heavy bias a Decision Tree can get. A forest classifier simply contains a set of decision trees and uses majority voting to make the prediction.

    In this tutorial we will try to use that on the stock market, by creating a few indicators. This tutorial will give a framework to explore if it can predict the direction of a stock. Given a set of indicators, will the stock go up or down the next trading day.

    This is a simplified problem of predicting the actual stock value the next day.

    Step 1: Getting data and calculate some indicators

    If you are new to stock indicators, we can highly recommend you to read about the MACD, RSI, Stochastic Oscillator, where the MACD also includes how to calculate the EMA. Here we assume familiarity to those indicators. Also, that you are familiar with Pandas DataFrames and Pandad-datareader.

    import pandas_datareader as pdr
    import datetime as dt
    import numpy as np
    ticker = "^GSPC" # The S&P 500 index
    data = pdr.get_data_yahoo(ticker, dt.datetime(2010,1,1), dt.datetime.now(), interval='d')
    # Calculate the EMA10 > EMA30 signal
    ema10 = data['Close'].ewm(span=10).mean()
    ema30 = data['Close'].ewm(span=30).mean()
    data['EMA10gtEMA30'] = np.where(ema10 > ema30, 1, -1)
    # Calculate where Close is > EMA10
    data['ClGtEMA10'] = np.where(data['Close'] > ema10, 1, -1)
    # Calculate the MACD signal
    exp1 = data['Close'].ewm(span=12).mean()
    exp2 = data['Close'].ewm(span=26).mean()
    macd = exp1 - exp2
    macd_signal = macd.ewm(span=9).mean()
    data['MACD'] = macd_signal - macd
    # Calculate RSI
    delta = data['Close'].diff()
    up = delta.clip(lower=0)
    down = -1*delta.clip(upper=0)
    ema_up = up.ewm(com=13, adjust=False).mean()
    ema_down = down.ewm(com=13, adjust=False).mean()
    rs = ema_up/ema_down
    data['RSI'] = 100 - (100/(1 + rs))
    # Stochastic Oscillator
    high14= data['High'].rolling(14).max()
    low14 = data['Low'].rolling(14).min()
    data['%K'] = (data['Close'] - low14)*100/(high14 - low14)
    # Williams Percentage Range
    data['%R'] = -100*(high14 - data['Close'])/(high14 - low14)
    days = 6
    # Price Rate of Change
    ct_n = data['Close'].shift(days)
    data['PROC'] = (data['Close'] - ct_n)/ct_n
    print(data)
    

    The choice of indicators is arbitrary but among some popular ones. It should be up to you to change them to other indicators and experiment with them.

                      High         Low        Open       Close       Volume   Adj Close  EMA10gtEMA30  ClGtEMA10      MACD         RSI         %K        %R      PROC
    Date                                                                                                                                                             
    2020-08-17  3387.590088  3379.219971  3380.860107  3381.989990  3671290000  3381.989990             1          1 -2.498718   68.294286  96.789344  -3.210656  0.009164
    2020-08-18  3395.060059  3370.149902  3387.040039  3389.780029  3881310000  3389.780029             1          1 -1.925573   69.176468  97.234576  -2.765424  0.008722
    2020-08-19  3399.540039  3369.659912  3392.510010  3374.850098  3884480000  3374.850098             1          1 -0.034842   65.419555  86.228281 -13.771719  0.012347
    2020-08-20  3390.800049  3354.689941  3360.479980  3385.510010  3642850000  3385.510010             1          1  0.949607   66.805725  87.801036 -12.198964  0.001526
    2020-08-21  3399.959961  3379.310059  3386.010010  3397.159912  3705420000  3397.159912             1          1  1.249066   68.301209  97.534948  -2.465052  0.007034
    

    Step 2: Understand the how the Decision Tree works

    Trees are the foundation in the Forest. Or Decision Trees are the foundation in a Forest Classifier. Hence, it is a good starting point to understand how a Decision Tree works. Luckily, they are quite easy to understand.

    Let’s try to investigate a Decision Tree that is based on two of the indicators above. We take the RSI (Relative Strength Index) and %K (Stochastic Oscillator). A Decision Tree could look like this (depending on the training data).

    Decision Tree for %K and RSI

    When we get a new data row with %K and RSI indicators, it will start at the top of the Decision Tree.

    • At the first node it will check if %K <= 4.615, if so, take the left child otherwise the right child.
    • The gini tells us how a randomly chosen element would be incorrectly labeled. Hence, a low value close to 0 is good.
    • Samples tells us how many of the samples of the training set reached this node.
    • Finally, the value tells us how the values are distributed. In the final decision nodes, the category of most element is the prediction.

    Looking at the above Decision Tree, it does not seem to be very good. The majority of samples end up the fifth node with a gini on 0.498, close to random, right? And it will label it 1, growth.

    But this is the idea with Forest Classifiers, it will take a bunch of Decision Trees, that might not be good, and use majority of them to classify it.

    Step 3: Create the Forest Classifier

    Now we understand how the Decision Tree and the Forest Classifier work, we just need to run the magic. As this is done by calling a library function.

    import pandas_datareader as pdr
    import datetime as dt
    import numpy as np
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import classification_report, accuracy_score
    from sklearn.ensemble import RandomForestClassifier
    
    ticker = "^GSPC"
    data = pdr.get_data_yahoo(ticker, dt.datetime(2010,1,1), dt.datetime.now(), interval='d')
    # Calculate the EMA10 > EMA30 signal
    ema10 = data['Close'].ewm(span=10).mean()
    ema30 = data['Close'].ewm(span=30).mean()
    data['EMA10gtEMA30'] = np.where(ema10 > ema30, 1, -1)
    # Calculate where Close is > EMA10
    data['ClGtEMA10'] = np.where(data['Close'] > ema10, 1, -1)
    # Calculate the MACD signal
    exp1 = data['Close'].ewm(span=12).mean()
    exp2 = data['Close'].ewm(span=26).mean()
    macd = exp1 - exp2
    macd_signal = macd.ewm(span=9).mean()
    data['MACD'] = macd_signal - macd
    # Calculate RSI
    delta = data['Close'].diff()
    up = delta.clip(lower=0)
    down = -1*delta.clip(upper=0)
    ema_up = up.ewm(com=13, adjust=False).mean()
    ema_down = down.ewm(com=13, adjust=False).mean()
    rs = ema_up/ema_down
    data['RSI'] = 100 - (100/(1 + rs))
    # Stochastic Oscillator
    high14= data['High'].rolling(14).max()
    low14 = data['Low'].rolling(14).min()
    data['%K'] = (data['Close'] - low14)*100/(high14 - low14)
    # Williams Percentage Range
    data['%R'] = -100*(high14 - data['Close'])/(high14 - low14)
    days = 6
    # Price Rate of Change
    ct_n = data['Close'].shift(days)
    data['PROC'] = (data['Close'] - ct_n)/ct_n
    # Set class labels to classify
    data['Return'] = data['Close'].pct_change(1).shift(-1)
    data['class'] = np.where(data['Return'] > 0, 1, 0)
    # Clean for NAN rows
    data = data.dropna()
    # Minimize dataset
    data = data.iloc[-200:]
    
    # Data to predict
    predictors = ['EMA10gtEMA30', 'ClGtEMA10', 'MACD', 'RSI', '%K', '%R', 'PROC']
    X = data[predictors]
    y = data['class']
    # Split data into train and test
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
    # Train the model
    rfc = RandomForestClassifier(random_state=0)
    rfc = rfc.fit(X_train, y_train)
    # Test the model by doing some predictions
    y_pred = rfc.predict(X_test)
    # See how accurate the predictions are
    report = classification_report(y_test, y_pred)
    print('Model accuracy', accuracy_score(y_test, y_pred, normalize=True))
    print(report)
    

    First some notes on a few lines. The train_test_split, divides the data into training set and test set. The test set is set to be 30% of the data. It does it in a randomized way.

    Next we create a RandomForestClassifier and fit it.

    Then we use our newly created classifier (rfc) to predict on the test set (X_test).

    Finally, we calculate the accuracy and the report.

    Model accuracy 0.6333333333333333
                  precision    recall  f1-score   support
               0       0.56      0.38      0.45        24
               1       0.66      0.81      0.73        36
        accuracy                           0.63        60
       macro avg       0.61      0.59      0.59        60
    weighted avg       0.62      0.63      0.62        60
    

    The model accuracy is 0.63, which seems quite good. It is better than random, at least. You can also see that the precision of 1 (growth) is higher than 0 (loss, or negative growth), with 0.66 and 0.56, respectively.

    Does that mean it is all good and we can beat the market?

    No, far from. Also, notice I chose to only use the last 200 stock days in my experiment out of the 2.500+ possible stock days.

    Running a few experiments it showed that it the prediction was close to 50% if all days were used. That means, basically it was not possible to predict.

    Step 4: A few more tests on stocks

    I have run a few experiments on different stocks and also varying the number of days used.

    Stock100 days200 Days400 Days
    S&P 5000.530.630.52
    AAPL0.530.620.54
    F0.670.570.54
    KO0.470.520.53
    IBM0.570.520.57
    MSFT0.500.500.48
    AMZN0.570.470.58
    TSLA0.500.600.53
    NVDA0.570.530.54
    The accuracy

    Looking in the above table I am not convinced about my hypotheses. First, the 200 days to be better, might have be specific on the stock. Also, if you re-run tests you get new numbers, as the training and test dataset are different from time to time.

    I did try a few with the full dataset, and I still think it performed worse (all close to 0.50).

    The above looks fine, as it mostly can predict better than just guessing. But still there are a few cases where it is not the case.

    Next steps

    A few things to remember here.

    Firstly, the indicators are chose at random from among the common ones. A further investigation on this could be an idea. It can highly bias the results if it is used does not help the prediction.

    Secondly, I might have falsely hypothesized that it was more accurate when we limited to data to a smaller set than the original set.

    Thirdly, it could be that the stocks are also having a bias in one direction. If we limit to a smaller period, a bull market will primarily have growth days, hence a biased guess on growth will be better than 0.50. This factor should be investigated further, to see if this favors the predictions.

    Python for Finance: Unlock Financial Freedom and Build Your Dream Life

    Discover the key to financial freedom and secure your dream life with Python for Finance!

    Say goodbye to financial anxiety and embrace a future filled with confidence and success. If you’re tired of struggling to pay bills and longing for a life of leisure, it’s time to take action.

    Imagine breaking free from that dead-end job and opening doors to endless opportunities. With Python for Finance, you can acquire the invaluable skill of financial analysis that will revolutionize your life.

    Make informed investment decisions, unlock the secrets of business financial performance, and maximize your money like never before. Gain the knowledge sought after by companies worldwide and become an indispensable asset in today’s competitive market.

    Don’t let your dreams slip away. Master Python for Finance and pave your way to a profitable and fulfilling career. Start building the future you deserve today!

    Python for Finance a 21 hours course that teaches investing with Python.

    Learn pandas, NumPy, Matplotlib for Financial Analysis & learn how to Automate Value Investing.

    “Excellent course for anyone trying to learn coding and investing.” – Lorenzo B.

    3 thoughts on “RandomForestClassifier: Predict Stock Market Direction”

    Leave a Comment