OpenCV and Python: Simple Noise-tolerant Motion Detector

What will we cover in this tutorial?

A simple motion detector can be created by the difference of two images of the same frame. A base image of all the static elements, compared with a new image of the same frame, to identify changes.

The challenge with such an approach is that it is not noise-tolerant and it can be difficult to ensure the correct base image.

In this tutorial we will see how to create a simple noise-tolerant motion detector.

The result of this tutorial

Step 1: Understand how we can make it noise-tolerant

In the previous tutorial we made a simple car counter. The approach was to compare the last two frames and take the difference between them. While it works it is not very robust to simple noise and will create a lot of false positives.

To deal with that we will modify this approach simply by making the last frame to consist of the average of the last 30 frames, and the current frame to be the weighted average of the last 10 frames.

The process will be like the illustration shows.

Process
  1. The first step is to resize the image for two reasons. First, we don’t need the detail level further down. Second, it will make processing faster.
  2. Then we blur the image to minimize the impact of small details.
  3. This step is to add the image to a foreground list, which will be used to calculate the first image. This list can be 10 frames. We will calculate the foreground image from it using a weighted average to give more weight to never frames.
  4. The frame is also added to the longer background list. The output of that list is a normal average.
  5. The difference is calculated of the output of the foreground list and the background list.
  6. After that the movement are detected on the difference image.
  7. The areas with movements will be encapsulated with boxes.
  8. Finally, the boxes will be resized back to the original frame and added on it.

That is the process.

Step 2: The implementation of the above algorithm

The code will be commented with the above numbering to better understand it. The focus has been to keep it simple. If you need to install OpenCV you can read this tutorial.

import cv2
import numpy as np
import imutils
import time
from collections import deque


# Input to Step 5: Helper function
# Calculate the foreground frame based on frames
def get_movement(frames, shape):
    movement_frame = np.zeros(shape, dtype='float32')
    i = 0
    for f in frames:
        i += 1
        movement_frame += f * i
    movement_frame = movement_frame / ((1 + i) / 2 * i)
    movement_frame[movement_frame > 254] = 255
    return movement_frame


# Input to Step 5: Helper function
# Calculate the background frame based on frames
# This function has obvious improvement potential
# - Could avoid to recalculate full list every time
def get_background(frames, shape):
    bg = np.zeros(shape, dtype='float32')
    for frame in frames:
        bg += frame
    bg /= len(frames)
    bg[bg > 254] = 255
    return bg


# Detect and return boxes of moving parts
def detect(frame, bg_frames, fg_frames, threshold=20, min_box=200):
    # Step 3-4: Add the frame to the our list of foreground and background frames
    fg_frames.append(frame)
    bg_frames.append(frame)

    # Input to Step 5: Calculate the foreground and background frame based on the lists
    fg_frame = get_movement(list(fg_frames), frame.shape)
    bg_frame = get_background(list(bg_frames), frame.shape)

    # Step 5: Calculate the difference to detect movement
    movement = cv2.absdiff(fg_frame, bg_frame)
    movement[movement < threshold] = 0
    movement[movement > 0] = 254
    movement = movement.astype('uint8')
    movement = cv2.cvtColor(movement, cv2.COLOR_BGR2GRAY)
    movement[movement > 0] = 254
    # As we don't return the movement frame, we show it here for debug purposes
    # Should be removed before release
    cv2.imshow('Movement', movement)

    # Step 6: Find the list of contours
    contours = cv2.findContours(movement, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    contours = imutils.grab_contours(contours)

    # Step 7: Convert them to boxes
    boxes = []
    for contour in contours:
        # Ignore small boxes
        if cv2.contourArea(contour) < min_box:
            continue
        # Convert the contour to a box and append it to the list
        box = cv2.boundingRect(contour)
        boxes.append(box)

    return boxes


def main(width=640, height=480, scale_factor=2):
    # Create the buffer of our lists
    bg_frames = deque(maxlen=30)
    fg_frames = deque(maxlen=10)

    # Get the webcam
    cap = cv2.VideoCapture(0)
    cap.set(cv2.CAP_PROP_FRAME_WIDTH, width)
    cap.set(cv2.CAP_PROP_FRAME_HEIGHT, height)

    # We want to see how many frames per second we process
    last_time = time.time()
    while True:
        # Step 0: Read the webcam frame (ignore return code)
        _, frame = cap.read()

        # Resize the frame
        frame = cv2.resize(frame, (width, height))
        # Step 1: Scale down to improve speed (only takes integer scale factors)
        work_frame = cv2.resize(frame, (width // scale_factor, height // scale_factor))
        # Step 2: Blur it and convert the frame to float32
        work_frame = cv2.GaussianBlur(work_frame, (5, 5), 0)
        work_frame_f32 = work_frame.astype('float32')

        # Step 3-7 (steps in function): Detect all the boxes around the moving parts
        boxes = detect(work_frame_f32, bg_frames, fg_frames)

        # Step 8: Draw all boxes (remember to scale back up)
        for x, y, w, h in boxes:
            cv2.rectangle(frame, (x * scale_factor, y * scale_factor), ((x + w) * scale_factor, (y + h) * scale_factor),
                          (0, 255, 0), 2)

        # Add the Frames Per Second (FPS) and show frame
        text = "FPS:" + str(int(1 / (time.time() - last_time)))
        last_time = time.time()
        cv2.putText(frame, text, (10, 20), cv2.FONT_HERSHEY_PLAIN, 2, (0, 255, 0), 2)
        cv2.imshow('Webcam', frame)

        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

    cap.release()
    cv2.destroyAllWindows()


if __name__ == "__main__":
    main()

Step 3: Test it

Let’s try it out in real life.

The above trial was done while recording from my laptop, which makes it quite slow (need a new one). The frame rate (FPS) without recording was about 30 FPS.

The detection can be set to the environment you like. You have the following parameters to adjust with.

  • width (default: 640): The size of the webcam frame. Only has a small impact on performance. Notice, if you change this, it has impacts the processing image sizes through scale_factor.
  • height (default: 480): The same as with width.
  • scale_factor (default 2): which scales the processed image down. As it is implemented it can only do it with
  • threshold (default: 20): It adjust how much change in the movement in order to detect it. The lower, the more it will detect.
  • min_box (default: 200): The size of the smallest box it will detect.
  • bg_frame = deque(maxlen(30)): The number of frames to keep in the background list. Should be larger than fg_frame.
  • fg_frame = deque(maxlen(10)): The number of frames to keep in the foreground list.

Those are the most obvious parameters you can adjust in your program. Notice, than some are affecting each other. It is kept that way to keep the code simple.

Next steps

If you increase the bg_frame to say, 120, you will see it slows down the processing. This is due to the long calculation done in get_background function. This can obviously be improved as it is a simple average we calculate. The reason I did not do it, is to keep it simple and understandable.

Also, I would like to try it out in real life to see how to fine-tune it at home. See how to set parameters to not get false positive due to changes in the lighting (if the sun suddenly shines in the room, or a cloud covers it).

Performance comparison of Numba vs Vectorization vs Lambda function with NumPy

What will we cover in this tutorial?

We will continue our investigation of Numba from this tutorial.

Numba is a just-in-time compiler for Python that works amazingly with NumPy. As we saw in the last tutorial, the built in vectorization can depending on the case and size of instance be faster than Numba.

Here we will explore that further as well to see how Numba compares with lambda functions. Lambda functions has the advantage, that they can be parsed as an argument down to a library that can optimize the performance and not depend on slow Python code.

Step 1: Example of Vectorization slower than Numba

In the previous tutorial we only investigated an example of vectorization, which was faster than Numba. Here we will see, that this is not always the case.

import numpy as np
from numba import jit
import time

size = 100
x = np.random.rand(size, size)
y = np.random.rand(size, size)
iterations = 100000


@jit(nopython=True)
def add_numba(a, b):
    c = np.zeros(a.shape)
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            c[i, j] = a[i, j] + b[i, j]
    return c


def add_vectorized(a, b):
    return a + b


# We call the function once, to precompile the code
z = add_numba(x, y)
start = time.time()
for _ in range(iterations):
    z = add_numba(x, y)
end = time.time()
print("Elapsed (numba, precompiled) = %s" % (end - start))

start = time.time()
for _ in range(iterations):
    z = add_vectorized(x, y)
end = time.time()
print("Elapsed (vectorized) = %s" % (end - start))

Varying the size of the NumPy array, we can see the performance between the two in the graph below.

Where it is clear that the vectorized approach is slower.

Step 2: Try some more complex example comparing vectorized and Numba

A if-then-else can be expressed as vectorized using the Numpy where function.

import numpy as np
from numba import jit
import time


size = 1000
x = np.random.rand(size, size)
iterations = 1000


@jit(nopython=True)
def numba(a):
    c = np.zeros(a.shape)
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            if a[i, j] < 0.5:
                c[i, j] = 1
    return c


def vectorized(a):
    return np.where(a < 0.5, 1, 0)


# We call the numba function to precompile it before we measure it
z = numba(x)
start = time.time()
for _ in range(iterations):
    z = numba(x)
end = time.time()
print("Elapsed (numba, precompiled) = %s" % (end - start))

start = time.time()
for _ in range(iterations):
    z = vectorized(x)
end = time.time()
print("Elapsed (vectorized) = %s" % (end - start))

This results in the following comparison.

That is close, but the vectorized approach is a bit faster.

Step 3: Compare Numba with lambda functions

I am very curious about this. Lambda functions are controversial in Python, and many are not happy about them as they have a lot of syntax, which is not aligned with Python. On the other hand, lambda functions have the advantage that you can send them down in the library that can optimize over the for-loops.

import numpy as np
from numba import jit
import time

size = 1000
x = np.random.rand(size, size)
iterations = 1000


@jit(nopython=True)
def numba(a):
    c = np.zeros((size, size))
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            c[i, j] = a[i, j] + 1
    return c


def lambda_run(a):
    return a.apply(lambda x: x + 1)


# Call the numba function to precompile it before time measurement
z = numba(x)
start = time.time()
for _ in range(iterations):
    z = numba(x)
end = time.time()
print("Elapsed (numba, precompiled) = %s" % (end - start))

start = time.time()
for _ in range(iterations):
    z = vectorized(x)
end = time.time()
print("Elapsed (vectorized) = %s" % (end - start))

Resulting in the following performance comparison.

This is again tight, but the lambda approach is still a bit faster.

Remember, this is a simple lambda function and we cannot conclude that lambda function in general are faster than using Numba.

Conclusion

Learnings since the last tutorial is that we have found an example where simple vectorization is slower than Numba. This still leads to the conclusion that performance highly depends on the task. Further, the lambda function seems to give promising performance. Again, this should be compared to the slow approach of a Python for-loop without Numba just-in-time compiled machine code.

deepFace + Python: Create a Look-alike Algorithm

What will we cover in this tutorial?

We will explore the deepFace library, which includes the state of the art face recognition algorithm. deepFace is a Python library as we like it you can do complicated stuff with only a few lines of code.

In this tutorial we will use the deepFace library to create a look-alike algorithm. That is, we will calculate which movie star you look most like. Your movie-star look-alike.

Step 1: Collect your library of movie stars you want to compare yourself to

Well, this requires you use pictures that are available of movie stars online on the internet.

My library consists of the following images.

My library of images

Just to clarify, the deepFace is not part of the library. I wonder if it can detect a face on it?

Let’s try that (the below code is not part of the official tutorial). If you need help to install the deepFace library read this tutorial.

from deepface import DeepFace

result = DeepFace.analyze("deepFaceLogo.png", actions=['age', 'gender', 'race', 'emotion'])
print(result)

Where we have the deepFaceLogo.png to be the following image.

ValueError: Face could not be detected. Please confirm that the picture is a face photo or consider to set enforce_detection param to False.

Bummer. The deepFace library cannot detect a face in it’s logo.

Well, back on track. Find a collection of movie stars and save them in a folder.

Step 2: Understand how easy deepFace is to use

A modern face recognition pipeline consists of 4 common stages: detectalignrepresent and verifyDeepFacehandles all these common stages in the background.

https://pypi.org/project/deepface/

A normal process would first identify where in the picture a face is located (the detect stage). This is needed to remove all unnecessary background in the image and only focus on the face. Then it would proceed to align it, so the eyes in the head are in a horizontal line (the align stage). That is, the head is not tilting to one side. This makes the face recognition algorithm work better as all faces will have the same alignment. Further, we will change the representation of the image (the aligned face part) in the model. Finally, verify if the distance is close. That is, the smaller distance from the image, the more we are certain it is the right person.

Let’s take a simple example. Let’s compare if I am Angelina Jolie. (If you need help to install the deepFace library read this tutorial).

from deepface import DeepFace

result = DeepFace.verify("rune.jpg", "pics-db/Angelina Jolie.jpg", model_name="VGG-Face")
print(result)

I have added the picture of me (rune.jpg) in the folder where I run my Python program and a pics-db with a picture of Angelina Jolie.

And the result looks like this.

{
  'verified': False,
  'distance': 0.7834609150886536,
  'max_threshold_to_verify': 0.4,
  'model': 'VGG-Face',
  'similarity_metric': 'cosine'
}

The algorithm has determined that I am not Angelina Jolie (‘verified’: False). Now to the interesting part, it has a distance (‘distance’: 0.7834609150886536), which we will use to determine which movie star you look the most like. Here it is important to understand, that the lower the distance, the more you look-alike. Also, you see that the maximum threshold to verify is 0.4. That is the distance should be less than 0.4 to conclude that it is the same person in the picture.

Step 3: Making our Look-alike algorithm

We have collected a library of images and located them in pics-db. Then I try with a picture of me (the same one I used in last step). (If you need help to install the deepFace library read this tutorial).

from deepface import DeepFace
import glob

pic = "rune.jpeg"
files = glob.glob("pics-db/*")

model = "VGG-Face"
results = []
for file in files:
    result = DeepFace.verify(pic, file, model_name=model)
    results.append((file, result['distance']))

results.sort(key=lambda x: x[1])
print("Model:", model)
for file, distance in results:
    print(file, distance)

Now I am a bit nervous, which one do I look the most like?

Model: VGG-Face
pics-db/Tom Cruise.jpg 0.4702601432800293
pics-db/The Rock.jpg 0.493824303150177
pics-db/Robert Downey Jr.jpg 0.4991753101348877
pics-db/Daniel Craig.jpg 0.5135003626346588
pics-db/Christian Bale.jpg 0.5176380276679993
pics-db/Brad Pitt.jpg 0.5225759446620941
pics-db/Will Smith.jpg 0.5245362818241119
pics-db/Michael Douglas.png 0.5407796204090118
pics-db/Keanu Reeves.jpg 0.5416552424430847
pics-db/Angelina Jolie.jpg 0.7834609150886536

Wow. I like this immediately. Tom Cruise, The Rock, Robert Downey Jr. Luckily, I look the least like Angelina Jolie, which is not that surprising (At least I would think so).

Are we done?

Maybe, it depends. I guess you can make an App or Web-service with a movie star look-alike algorithm.

You can also play around with different models. The deepFace library contains the following: “VGG-Face”, “Facenet”, “OpenFace”, “DeepFace”, “DeepID”, and “Dlib”.

The results are not the same.

Model: Facenet
pics-db/Tom Cruise.jpg 0.7826492786407471
pics-db/Brad Pitt.jpg 0.870269775390625
pics-db/The Rock.jpg 0.8774074390530586
pics-db/Keanu Reeves.jpg 0.9102083444595337
pics-db/Daniel Craig.jpg 0.914682649075985
pics-db/Christian Bale.jpg 0.9467008262872696
pics-db/Robert Downey Jr.jpg 1.0119212558493018
pics-db/Angelina Jolie.jpg 1.0243268758058548
pics-db/Michael Douglas.png 1.067187249660492
pics-db/Will Smith.jpg 1.2313244044780731

The Facenet model says I look less like Will Smith and more like Angelina Jolie. Not sure I trust this one.

If you enjoy this I recommend you read the following tutorial by the author of deepface Sefik Ilkin Serengil.

How to Get Started with DeepFace using PyCharm

What will we cover in this tutorial?

In this tutorial we will show you how to setup your virtual environment in PyCharm to use Deepface. Then run a small program from DeepFace.

This tutorial has been done for both Mac and Windows. See the additional notes for Windows at the end if you experience problems.

Step 1: Importing the DeepFace library and run our first program

You know how it works.

from deepface import DeepFace


demography = DeepFace.analyze("angelina.jpg", actions=['age', 'gender', 'race', 'emotion'])
print("Age: ", demography["age"])
print("Gender: ", demography["gender"])
print("Emotion: ", demography["dominant_emotion"])
print("Race: ", demography["dominant_race"])

You see that deepface is not available and click to install it.

It seems to work. But then…

Okay. Let’s do it the easy way. Just add a import dlib in your code.

But it fails.

Step 2: Install CMake to make it work

If you search the internet you will see you need also to install CMake to make DeepFace work.

So let’s import that as well.

We end up with this code.

from deepface import DeepFace
import cmake
import dlib


demography = DeepFace.analyze("angelina.jpg", actions=['age', 'gender', 'race', 'emotion'])
print("Age: ", demography["age"])
print("Gender: ", demography["gender"])
print("Emotion: ", demography["dominant_emotion"])
print("Race: ", demography["dominant_race"])

Where you first install cmake by putting your mouse on top of the red line under cmake and choose install. Then you do the same for dlib.

And by magic it works.

Step 3: Run our little program

You need to get the picture of Angelina (angelina.jpg). Apparently, the authors of this DeepFace have a thing for her and use her as an example.

Angelina

Then when you run the program you will get the following output, as it needs to download a lot of stuff.

Actions to do:  ['age', 'gender', 'race', 'emotion']
facial_expression_model_weights.h5 will be downloaded...
Downloading...
From: https://drive.google.com/uc?id=13iUHHP3SlNg53qSuQZDdHDSDNdBP9nwy
To: /Users/admin/.deepface/weights/facial_expression_model_weights.zip
5.54MB [00:00, 10.4MB/s]
age_model_weights.h5 will be downloaded...
Downloading...
From: https://drive.google.com/uc?id=1YCox_4kJ-BYeXq27uUbasu--yz28zUMV
To: /Users/admin/.deepface/weights/age_model_weights.h5
539MB [01:04, 8.36MB/s]
Downloading...
From: https://drive.google.com/uc?id=1wUXRVlbsni2FN9-jkS_f4UTUrm1bRLyk
To: /Users/admin/.deepface/weights/gender_model_weights.h5
79.7MB [00:08, 10.1MB/s]gender_model_weights.h5 will be downloaded...
537MB [00:58, 9.16MB/s]
Downloading...
From: https://drive.google.com/uc?id=1nz-WDhghGQBC4biwShQ9kYjvQMpO6smj
To: /Users/admin/.deepface/weights/race_model_single_batch.zip
78.3MB [00:08, 9.37MB/s]race_model_single_batch.h5 will be downloaded...
511MB [00:54, 9.35MB/s]
Analyzing:   0%|          | 0/1 [00:00<?, ?it/s]
Finding actions:   0%|          | 0/4 [00:00<?, ?it/s]
Action: age:   0%|          | 0/4 [00:00<?, ?it/s]    
Action: age:  25%|██▌       | 1/4 [00:01<00:05,  1.79s/it]
Action: gender:  25%|██▌       | 1/4 [00:01<00:05,  1.79s/it]
Action: gender:  50%|█████     | 2/4 [00:02<00:03,  1.54s/it]
Action: race:  50%|█████     | 2/4 [00:02<00:03,  1.54s/it]  
Action: race:  75%|███████▌  | 3/4 [00:03<00:01,  1.23s/it]
Action: emotion:  75%|███████▌  | 3/4 [00:03<00:01,  1.23s/it]
Action: emotion: 100%|██████████| 4/4 [00:03<00:00,  1.13it/s]
Analyzing:   0%|          | 0/1 [00:03<?, ?it/s]

Age:  33.10586443589396
Gender:  Woman
Emotion:  neutral
Race:  white

Hmm… Apparently she is 33.1 years old on that picture. Angelina is also a woman and has neutral emotions. Finally, she is white.

Let’s try to see what it says about me.

Me

I am happy to see this.

Age:  27.25606073226045
Gender:  Man
Emotion:  happy
Race:  white

I am still in my 20ies. How can you not love that. DeepFace is considered to be top of the art, so let’s not doubt that.

Additional notes for Windows

To ensure that it also worked on Windows, I tried it out and ran into a few challenges there.

First of all, I was using a 32-bit version of Python, which was causing tensorflow not to be installed our found a matching library. Hence, make sure to install Python 64-bit version on Windows. You can have both versions running in parallel. The easiest way to get PyCharm to use your 64-bit version is to create a new project and make the default Python using the 64-bit version.

Secondly, it was still having troubles. It was not having a new enough version of C/C++ compiler. In this case I needed to update my Visual Studio.

Then the above tutorial just ran like a charm on PyCharm.

Master Markowitz Portfolio Optimization (Efficient Frontier) in Python using Pandas

What is Markowitz Portfolios Optimization (Efficient Frontier)?

The Efficient Frontier takes a portfolio of investments and optimizes the expected return in regards to the risk. That is to find the optimal return for a risk.

According to investopedia.org the return is based on the expected Compound Annual Growth Rate (CAGR) and risk metric is the standard deviation of the return.

But what does all that mean? We will learn that in this tutorial.

Step 1: Get the time series of your stock portfolio

We will use the following portfolio of 4 stocks of Apple (AAPL), Microsoft (MSFT), IBM (IBM) and Nvidia (NVDA).

To get the time series we will use the Yahoo! Finance API through the Pandas-datareader.

We will look 5 years back.

import pandas_datareader as pdr
import pandas as pd
import datetime as dt
from dateutil.relativedelta import relativedelta

years = 5
end_date = dt.datetime.now()
start_date = end_date - relativedelta(years=years)
close_price = pd.DataFrame()
tickers = ['AAPL','MSFT','IBM','NVDA']
for ticker in tickers:
  tmp = pdr.get_data_yahoo(ticker, start_date, end_date)
  close_price[ticker] = tmp['Close']

print(close_price)

Resulting in the following output (or the first few lines).

                  AAPL        MSFT         IBM        NVDA
Date                                                      
2015-08-25  103.739998   40.470001  140.960007   20.280001
2015-08-26  109.690002   42.709999  146.699997   21.809999
2015-08-27  112.919998   43.900002  148.539993   22.629999
2015-08-28  113.290001   43.930000  147.979996   22.730000
2015-08-31  112.760002   43.520000  147.889999   22.480000

It will contain all the date time series for the last 5 years from current date.

Step 2: Calculate the CAGR, returns, and covariance

To calculate the expected return, we use the Compound Average Growth Rate (CAGR) based on the last 5 years. The CAGR is used as investopedia suggest. An alternative that also is being used is the mean of the returns. The key thing is to have some common measure of the return.

The CAGR is calculated as follows.

CAGR = (end-price/start-price)^(1/years) – 1

We will also calculate the covariance as we will use that the calculate the variance of a weighted portfolio. Remember that the standard deviation is given by the following.

sigma = sqrt(variance)

A portfolio is a vector w with the balances of each stock. For example, given w = [0.2, 0.3, 0.4, 0.1], will say that we have 20% in the first stock, 30% in the second, 40% in the third, and 10% in the final stock. It all sums up to 100%.

Given a weight w of the portfolio, you can calculate the variance of the stocks by using the covariance matrix.

variance = w^T Cov w

Where Cov is the covariance matrix.

This results in the following pre-computations.

returns = close_price/close_price.shift(1)
cagr = (close_price.iloc[-1]/close_price.iloc[0])**(1/years) - 1
cov = returns.cov()

print(cagr)
print(cov)

Where you can see the output here.

# CACR:
AAPL    0.371509
MSFT    0.394859
IBM    -0.022686
NVDA    0.905011
dtype: float64

# Covariance
          AAPL      MSFT       IBM      NVDA
AAPL  0.000340  0.000227  0.000152  0.000297
MSFT  0.000227  0.000303  0.000164  0.000306
IBM   0.000152  0.000164  0.000260  0.000210
NVDA  0.000297  0.000306  0.000210  0.000879

Step 3: Plot the return and risk

This is where the power of computing comes into the picture. The idea is to just try a random portfolio and see how it rates with regards to expected return and risk.

It is that simple. Make a random weighted distribution of your portfolio and plot the point of expected return (based on our CAGR) and the risk based on the standard deviation calculated by the covariance.

import matplotlib.pyplot as plt
import numpy as np

def random_weights(n):
    k = np.random.rand(n)
    return k / sum(k)

exp_return = []
sigma = []
for _ in range(20000):
  w = random_weights(len(tickers))
  exp_return.append(np.dot(w, cagr.T))
  sigma.append(np.sqrt(np.dot(np.dot(w.T, cov), w)))

plt.plot(sigma, exp_return, 'ro', alpha=0.1) 
plt.show()

We introduce a helper function random_weights, which returns a weighted portfolio. That is, it returns a vector with entries that sum up to one. This will give a way to distribute our portfolio of stocks.

Then we iterate 20.000 times (could be any value, just want to have enough to plot our graph), where we make a random weight w, then calculate the expected return by the dot-product of w and cagr-transposed. This is done by using NumPy’s dot-product function.

What a dot-product of np.dot(w, cagr.T) does is to take elements pairwise from w and cagr and multiply them and sum up. The transpose is only about the orientation of it to make it work.

The standard deviation (assigned to sigma) is calculated similar by the formula given in the last step: variance = w^T Cov w (which has dot-products between).

This results in the following graph.

Returns vs risks

This shows a graph which outlines a parabola. The optimal values lie along the upper half of the parabola line. Hence, given a risk, the optimal portfolio is one corresponding on the upper boarder of the filled parabola.

Considerations

The Efficient Frontier gives you a way to balance your portfolio. The above code can by trial an error find such a portfolio, but it still leaves out some consideratoins.

How often should you re-balance? It has a cost to do that.

The theory behind has some assumptions that may not be a reality. As investopedia points out, it assumes that asset returns follow a normal distribution, but in reality returns can be more the 3 standard deviations away. Also, the theory builds upon that investors are rational in their investment, which is by most considered a flawed assumption, as more factors play into the investments.

The full source code

Below here you find the full source code from the tutorial.

import pandas_datareader as pdr
import datetime as dt
import pandas as pd
from dateutil.relativedelta import relativedelta
import matplotlib.pyplot as plt
import numpy as np


years = 5
end_date = dt.datetime.now()
start_date = end_date - relativedelta(years=years)
close_price = pd.DataFrame()
tickers = ['AAPL', 'MSFT', 'IBM', 'NVDA']
for ticker in tickers:
    tmp = pdr.get_data_yahoo(ticker, start_date, end_date)
    close_price[ticker] = tmp['Close']

returns = close_price / close_price.shift(1)
cagr = (close_price.iloc[-1] / close_price.iloc[0]) ** (1 / years) - 1
cov = returns.cov()

def random_weights(n):
    k = np.random.rand(n)
    return k / sum(k)

exp_return = []
sigma = []
for _ in range(20000):
    w = random_weights(len(tickers))
    exp_return.append(np.dot(w, cagr.T))
    sigma.append(np.sqrt(np.dot(np.dot(w.T, cov), w)))

plt.plot(sigma, exp_return, 'ro', alpha=0.1)
plt.show()

RandomForestClassifier: Predict Stock Market Direction

What will we cover in this tutorial?

A Forest Classifier is an approach to minimize the heavy bias a Decision Tree can get. A forest classifier simply contains a set of decision trees and uses majority voting to make the prediction.

In this tutorial we will try to use that on the stock market, by creating a few indicators. This tutorial will give a framework to explore if it can predict the direction of a stock. Given a set of indicators, will the stock go up or down the next trading day.

This is a simplified problem of predicting the actual stock value the next day.

Step 1: Getting data and calculate some indicators

If you are new to stock indicators, we can highly recommend you to read about the MACD, RSI, Stochastic Oscillator, where the MACD also includes how to calculate the EMA. Here we assume familiarity to those indicators. Also, that you are familiar with Pandas DataFrames and Pandad-datareader.

import pandas_datareader as pdr
import datetime as dt
import numpy as np

ticker = "^GSPC" # The S&P 500 index
data = pdr.get_data_yahoo(ticker, dt.datetime(2010,1,1), dt.datetime.now(), interval='d')

# Calculate the EMA10 > EMA30 signal
ema10 = data['Close'].ewm(span=10).mean()
ema30 = data['Close'].ewm(span=30).mean()
data['EMA10gtEMA30'] = np.where(ema10 > ema30, 1, -1)

# Calculate where Close is > EMA10
data['ClGtEMA10'] = np.where(data['Close'] > ema10, 1, -1)

# Calculate the MACD signal
exp1 = data['Close'].ewm(span=12).mean()
exp2 = data['Close'].ewm(span=26).mean()
macd = exp1 - exp2
macd_signal = macd.ewm(span=9).mean()
data['MACD'] = macd_signal - macd

# Calculate RSI
delta = data['Close'].diff()
up = delta.clip(lower=0)
down = -1*delta.clip(upper=0)
ema_up = up.ewm(com=13, adjust=False).mean()
ema_down = down.ewm(com=13, adjust=False).mean()
rs = ema_up/ema_down
data['RSI'] = 100 - (100/(1 + rs))

# Stochastic Oscillator
high14= data['High'].rolling(14).max()
low14 = data['Low'].rolling(14).min()
data['%K'] = (data['Close'] - low14)*100/(high14 - low14)

# Williams Percentage Range
data['%R'] = -100*(high14 - data['Close'])/(high14 - low14)

days = 6

# Price Rate of Change
ct_n = data['Close'].shift(days)
data['PROC'] = (data['Close'] - ct_n)/ct_n

print(data)

The choice of indicators is arbitrary but among some popular ones. It should be up to you to change them to other indicators and experiment with them.

                  High         Low        Open       Close       Volume   Adj Close  EMA10gtEMA30  ClGtEMA10      MACD         RSI         %K        %R      PROC
Date                                                                                                                                                             

2020-08-17  3387.590088  3379.219971  3380.860107  3381.989990  3671290000  3381.989990             1          1 -2.498718   68.294286  96.789344  -3.210656  0.009164
2020-08-18  3395.060059  3370.149902  3387.040039  3389.780029  3881310000  3389.780029             1          1 -1.925573   69.176468  97.234576  -2.765424  0.008722
2020-08-19  3399.540039  3369.659912  3392.510010  3374.850098  3884480000  3374.850098             1          1 -0.034842   65.419555  86.228281 -13.771719  0.012347
2020-08-20  3390.800049  3354.689941  3360.479980  3385.510010  3642850000  3385.510010             1          1  0.949607   66.805725  87.801036 -12.198964  0.001526
2020-08-21  3399.959961  3379.310059  3386.010010  3397.159912  3705420000  3397.159912             1          1  1.249066   68.301209  97.534948  -2.465052  0.007034

Step 2: Understand the how the Decision Tree works

Trees are the foundation in the Forest. Or Decision Trees are the foundation in a Forest Classifier. Hence, it is a good starting point to understand how a Decision Tree works. Luckily, they are quite easy to understand.

Let’s try to investigate a Decision Tree that is based on two of the indicators above. We take the RSI (Relative Strength Index) and %K (Stochastic Oscillator). A Decision Tree could look like this (depending on the training data).

Decision Tree for %K and RSI

When we get a new data row with %K and RSI indicators, it will start at the top of the Decision Tree.

  • At the first node it will check if %K <= 4.615, if so, take the left child otherwise the right child.
  • The gini tells us how a randomly chosen element would be incorrectly labeled. Hence, a low value close to 0 is good.
  • Samples tells us how many of the samples of the training set reached this node.
  • Finally, the value tells us how the values are distributed. In the final decision nodes, the category of most element is the prediction.

Looking at the above Decision Tree, it does not seem to be very good. The majority of samples end up the fifth node with a gini on 0.498, close to random, right? And it will label it 1, growth.

But this is the idea with Forest Classifiers, it will take a bunch of Decision Trees, that might not be good, and use majority of them to classify it.

Step 3: Create the Forest Classifier

Now we understand how the Decision Tree and the Forest Classifier work, we just need to run the magic. As this is done by calling a library function.

import pandas_datareader as pdr
import datetime as dt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.ensemble import RandomForestClassifier


ticker = "^GSPC"
data = pdr.get_data_yahoo(ticker, dt.datetime(2010,1,1), dt.datetime.now(), interval='d')

# Calculate the EMA10 > EMA30 signal
ema10 = data['Close'].ewm(span=10).mean()
ema30 = data['Close'].ewm(span=30).mean()
data['EMA10gtEMA30'] = np.where(ema10 > ema30, 1, -1)

# Calculate where Close is > EMA10
data['ClGtEMA10'] = np.where(data['Close'] > ema10, 1, -1)

# Calculate the MACD signal
exp1 = data['Close'].ewm(span=12).mean()
exp2 = data['Close'].ewm(span=26).mean()
macd = exp1 - exp2
macd_signal = macd.ewm(span=9).mean()
data['MACD'] = macd_signal - macd

# Calculate RSI
delta = data['Close'].diff()
up = delta.clip(lower=0)
down = -1*delta.clip(upper=0)
ema_up = up.ewm(com=13, adjust=False).mean()
ema_down = down.ewm(com=13, adjust=False).mean()
rs = ema_up/ema_down
data['RSI'] = 100 - (100/(1 + rs))

# Stochastic Oscillator
high14= data['High'].rolling(14).max()
low14 = data['Low'].rolling(14).min()
data['%K'] = (data['Close'] - low14)*100/(high14 - low14)

# Williams Percentage Range
data['%R'] = -100*(high14 - data['Close'])/(high14 - low14)

days = 6

# Price Rate of Change
ct_n = data['Close'].shift(days)
data['PROC'] = (data['Close'] - ct_n)/ct_n

# Set class labels to classify
data['Return'] = data['Close'].pct_change(1).shift(-1)
data['class'] = np.where(data['Return'] > 0, 1, 0)

# Clean for NAN rows
data = data.dropna()
# Minimize dataset
data = data.iloc[-200:]


# Data to predict
predictors = ['EMA10gtEMA30', 'ClGtEMA10', 'MACD', 'RSI', '%K', '%R', 'PROC']
X = data[predictors]
y = data['class']

# Split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

# Train the model
rfc = RandomForestClassifier(random_state=0)
rfc = rfc.fit(X_train, y_train)

# Test the model by doing some predictions
y_pred = rfc.predict(X_test)

# See how accurate the predictions are
report = classification_report(y_test, y_pred)
print('Model accuracy', accuracy_score(y_test, y_pred, normalize=True))
print(report)

First some notes on a few lines. The train_test_split, divides the data into training set and test set. The test set is set to be 30% of the data. It does it in a randomized way.

Next we create a RandomForestClassifier and fit it.

Then we use our newly created classifier (rfc) to predict on the test set (X_test).

Finally, we calculate the accuracy and the report.

Model accuracy 0.6333333333333333
              precision    recall  f1-score   support

           0       0.56      0.38      0.45        24
           1       0.66      0.81      0.73        36

    accuracy                           0.63        60
   macro avg       0.61      0.59      0.59        60
weighted avg       0.62      0.63      0.62        60

The model accuracy is 0.63, which seems quite good. It is better than random, at least. You can also see that the precision of 1 (growth) is higher than 0 (loss, or negative growth), with 0.66 and 0.56, respectively.

Does that mean it is all good and we can beat the market?

No, far from. Also, notice I chose to only use the last 200 stock days in my experiment out of the 2.500+ possible stock days.

Running a few experiments it showed that it the prediction was close to 50% if all days were used. That means, basically it was not possible to predict.

Step 4: A few more tests on stocks

I have run a few experiments on different stocks and also varying the number of days used.

Stock100 days200 Days400 Days
S&P 5000.530.630.52
AAPL0.530.620.54
F0.670.570.54
KO0.470.520.53
IBM0.570.520.57
MSFT0.500.500.48
AMZN0.570.470.58
TSLA0.500.600.53
NVDA0.570.530.54
The accuracy

Looking in the above table I am not convinced about my hypotheses. First, the 200 days to be better, might have be specific on the stock. Also, if you re-run tests you get new numbers, as the training and test dataset are different from time to time.

I did try a few with the full dataset, and I still think it performed worse (all close to 0.50).

The above looks fine, as it mostly can predict better than just guessing. But still there are a few cases where it is not the case.

Next steps

A few things to remember here.

Firstly, the indicators are chose at random from among the common ones. A further investigation on this could be an idea. It can highly bias the results if it is used does not help the prediction.

Secondly, I might have falsely hypothesized that it was more accurate when we limited to data to a smaller set than the original set.

Thirdly, it could be that the stocks are also having a bias in one direction. If we limit to a smaller period, a bull market will primarily have growth days, hence a biased guess on growth will be better than 0.50. This factor should be investigated further, to see if this favors the predictions.

Master Dow Theory with Python Pandas

What will we cover in this tutorial?

Dow theory was proposed by Charles H. Dow and is not an exact science. It is more how to identify trends in the market. In this tutorial we investigate the approach by testing it on data. Notice, that there are various ways to interpret it and often it is done by visual approximations, while we in this tutorial will make some rough assumptions to see if it beats the buy-and-hold approach of a stock.

First we will make our assumption on how to implement the Dow theory approach to make buy and sell indicators, which we will use as buy and sell markers in the market.

Step 1: Understand the Dow theory to make buy and sell indicators

The essence of Dow theory is that there are 3 types of trend in the market. The primary trend is a year or more long trend, like a bull market. Then on a secondary trend, the market can move in opposite direction for 3 weeks to 3 months. This can result in a pullback, that can seem like a bear market within the bull market. Finally, there are micro trends (less than 3 weeks) which can be considered as noise.

According to Dow theory each market has 3 phases. Our objective as an investor is to identify when a bear market turns into bull market.

Some visual example to understand the above will help a bit. A general bull market with primary and secondary trends could look like this.

Primary bull market trend with secondary bear market trends.

Where you should notice that the temporary lows are all increasing along the way.

A similar picture for a bear market could be.

Primary bear market trend with secondary bull market trends.

Here you should notice how the secondary bull markets peaks are also in a decreasing trend.

Step 2: Identify when a primary market trend changes

The key here is to identify when a primary stock trend goes from bull to bear or opposite.

Please also notice that Dow theory talks about the market and we here are looking at a stock. Hence, we have an assumption that the market and the stock have a strong enough correlation to use the same theory.

From a primary bear to a primary bull market could look like as follows.

From bear to bull market

We have added some markers in the diagram.

  • LL : Low-Low – meaning that the low is lower than previous low.
  • LH : Low-High – meaning that the high is lower than previous high.
  • HH : High-High – meaning that the high is higher than previous high.
  • HL : High-Low – meaning that the low is higher than previous low.

As you see, the bear market consists of consecutive LL and LH, while a bull market consists of consecutive HH and LH. The market changes from bear to bull when we confidently can say that we will get a HH, which we can do when we cross from the last LL over the last LH (before we reach HH).

Hence, a buy signal can be set when we reach a stock price above last LH.

Similar we can investigate the when a primary trends goes from bull to hear market.

From bull to a bear trend.

Where we have the same types of markers.

We see that the trend changes from bull to bear when we go from HL to LL. Hence, a sell indicator is when we are sure we reach a LL (that is before it is a LL).

Again, this is not an exact science and is just a way to interpret it. We will try it out on real stock data to see how it performs.

Step 3: Get some data and calculate points of lows and highs

We will use Pandas-datareader to get the time series data from Yahoo! Finance.

import pandas_datareader as pdr
import datetime as dt


ticker = pdr.get_data_yahoo("TWTR", dt.datetime(2020,1,1), dt.datetime.now())

print(ticker)

Resulting in a time series for Twitter, which has the ticker TWTR. You can find other tickers for other companies by using the Yahoo! Finance ticker lookup.

                 High        Low       Open      Close    Volume  Adj Close
Date                                                                       
2020-01-02  32.500000  31.959999  32.310001  32.299999  10721100  32.299999
2020-01-03  32.099998  31.260000  31.709999  31.520000  14429500  31.520000
2020-01-06  31.709999  31.160000  31.230000  31.639999  12582500  31.639999
2020-01-07  32.700001  31.719999  31.799999  32.540001  13712900  32.540001
2020-01-08  33.400002  32.349998  32.349998  33.049999  14632400  33.049999
...               ...        ...        ...        ...       ...        ...
2020-08-12  38.000000  36.820000  37.500000  37.439999  11013300  37.439999
2020-08-13  38.270000  37.369999  37.430000  37.820000  13259400  37.820000
2020-08-14  37.959999  37.279999  37.740002  37.900002  10377300  37.900002
2020-08-17  38.090000  37.270000  37.950001  37.970001  10188500  37.970001
2020-08-18  38.459999  37.740002  38.279999  38.009998   8548300  38.009998

First thing we need to get is to find the low and highs. First challenge here is that the stock price is going up and down during the day. To simplify our investigation we will only use the Close price.

Taking that decision might limit and not give correct results, but it surely simplifies our work.

Next up, we need to identify highs and lows. This can be done to see when a daily difference goes from positive to negative.

import pandas_datareader as pdr
import datetime as dt


ticker = pdr.get_data_yahoo("TWTR", dt.datetime(2020,1,1), dt.datetime.now())

ticker['delta'] = ticker['Close'].diff()
growth = ticker['delta'] > 0
ticker['markers'] = growth.diff().shift(-1)

print(ticker)

Please notice the shit(-1) as it moves the indicator on the day of the change.

2020-08-05  37.340000  36.410000  36.560001  36.790001   10052100  36.790001  0.440002   False
2020-08-06  37.810001  36.490002  36.849998  37.689999   10478900  37.689999  0.899998    True
2020-08-07  38.029999  36.730000  37.419998  37.139999   11335100  37.139999 -0.549999    True
2020-08-10  39.169998  37.310001  38.360001  37.439999   29298400  37.439999  0.299999    True
2020-08-11  39.000000  36.709999  37.590000  37.279999   20486000  37.279999 -0.160000    True
2020-08-12  38.000000  36.820000  37.500000  37.439999   11013300  37.439999  0.160000   False
2020-08-13  38.270000  37.369999  37.430000  37.820000   13259400  37.820000  0.380001   False
2020-08-14  37.959999  37.279999  37.740002  37.900002   10377300  37.900002  0.080002   False
2020-08-17  38.090000  37.270000  37.950001  37.970001   10188500  37.970001  0.070000   False
2020-08-18  38.459999  37.740002  38.279999  38.009998    8548300  38.009998  0.039997     NaN

Where we have output above. The True values are when we reach Highs or Lows.

Now we have identified all the potential HH, LH, LH, and LL.

Step 4: Implement a simple trial of sell and buy

We continue our example on Twitter and see how we can perform.

Our strategy will be as follows.

  • We either have bought stocks for all our money or not. That is, either we have stocks or not.
  • If we do not have stocks, we buy if stock price is above last high, meaning that a HH is coming.
  • If we do have stocks, we sell if stock price is below last low, meaning that a LL is coming.

This can mean that we enter market in the last of a bull market. If you were to follow the theory complete, it suggest to wait until a bear market changes to a bull market.

import pandas_datareader as pdr
import datetime as dt


ticker = pdr.get_data_yahoo("TWTR", dt.datetime(2020,1,1), dt.datetime.now())

ticker['delta'] = ticker['Close'].diff()
growth = ticker['delta'] > 0
ticker['markers'] = growth.diff().shift(-1)

# We want to remember the last_high and last_low
# Set to max value not to trigger false buy
last_high = ticker['Close'].max()
last_low = 0.0
# Then setup our account, we can only have stocks or not
# We have a start balance of 100000 $
has_stock = False
balance = 100000
stocks = 0
for index, row in ticker.iterrows():
  # Buy and sell orders
  if not has_stock and row['Close'] > last_high:
    has_stock = True
    stocks = balance//row['Close']
    balance -= row['Close']*stocks
  elif has_stock and row['Close'] < last_low:
    has_stock = False
    balance += row['Close']*stocks
    stocks = 0

  # Update the last_high and last_low
  if row['markers']:
    if row['delta'] > 0:
      last_high = row['Close']
    else:
      last_low = row['Close']


print("Dow returns", balance + stocks*ticker['Close'].iloc[-1])

# Compare this with a simple buy and hold approach.
buy_hold_stocks = 100000//ticker['Close'].iloc[0]
buy_hold = 100000 - buy_hold_stocks*ticker['Close'].iloc[0] + buy_hold_stocks*ticker['Close'].iloc[-1]
print("Buy-and-hold return", buy_hold)

Which results in the following results.

Dow returns 120302.0469455719
Buy-and-hold return 117672.44716644287

That looks promising, but it might be just out of luck. Hence, we want to validate with other examples. The results say a return of investment of 20.3% using our Dow theory approach, while a simple buy-and-hold strategy gave 17.7%. This is over the span of less than 8 months.

The thing you would like to achieve with a strategy is to avoid big losses and not loose out on revenue. The above testing does not justify any clarification on that.

Step 5: Try out some other tickers to test it

A first investigation is to check how the algorithm performs on other stocks. We make one small adjustment, as the comparison to buy on day-1, might be quite unfair. If price is low, it an advantage, while if the price is high, it is a big disadvantage. The code below runs on multiple stocks and compare the first buy with a Dow approach (as outlined in this tutorial) with a buy-and-hold approach. The exit of the market might also be unfair.

import pandas_datareader as pdr
import datetime as dt

def dow_vs_hold_and_buy(ticker_name):
  ticker = pdr.get_data_yahoo(ticker_name, dt.datetime(2020,1,1), dt.datetime.now())

  ticker['delta'] = ticker['Close'].diff()
  growth = ticker['delta'] > 0
  ticker['markers'] = growth.diff().shift(-1)

  # We want to remember the last_high and last_low
  # Set to max value not to trigger false buy
  last_high = ticker['Close'].max()
  last_low = 0.0
  # Then setup our account, we can only have stocks or not
  # We have a start balance of 100000 $
  has_stock = False
  balance = 100000
  stocks = 0
  first_buy = None
  for index, row in ticker.iterrows():
    # Buy and sell orders
    if not has_stock and row['Close'] > last_high:
      has_stock = True
      stocks = balance//row['Close']
      balance -= row['Close']*stocks
      if first_buy is None:
        first_buy = index
    elif has_stock and row['Close'] < last_low:
      has_stock = False
      balance += row['Close']*stocks
      stocks = 0

    # Update the last_high and last_low
    if row['markers']:
      if row['delta'] > 0:
        last_high = row['Close']
      else:
        last_low = row['Close']

  dow_returns = balance + stocks*ticker['Close'].iloc[-1]

  # Compare this wiith a simple buy and hold approach.
  buy_hold_stocks = 100000//ticker['Close'].loc[first_buy]
  buy_hold_returns = 100000 - buy_hold_stocks*ticker['Close'].loc[first_buy] + buy_hold_stocks*ticker['Close'].iloc[-1]

  print(ticker_name, dow_returns > buy_hold_returns, round(dow_returns/1000 - 100, 1), round(buy_hold_returns/1000 - 100, 1))


tickers = ["TWTR", "AAPL", "TSLA", "BAC", "KO", "GM", "MSFT", "AMZN", "GOOG", "FB", "INTC", "T"]
for ticker in tickers:
  dow_vs_hold_and_buy(ticker)

Resulting in the following output.

TWTR   True  20.3  14.4
AAPL  False  26.4  52.3
TSLA   True 317.6 258.8
BAC    True -16.3 -27.2
KO     True  -8.2 -14.6
GM     True   8.9 -15.1
MSFT  False  26.2  32.1
AMZN  False  32.8  73.9
GOOG  False   7.1  11.0
FB     True  18.3  18.2
INTC  False -34.9 -18.4
T     False -25.3 -20.8

This paints a different picture. First, it seems more random if it outperforms the buy-and-hold approach.

The one performing best is the General Motors Company (GM), but it might be due to unlucky entering of the market. The stock was high in the beginning of the year, and then fell a lot. Hence, here the Dow helped to exit and enter the market correct.

Intel Corporation (INTC) is working a lot against us. While there is a big loss (-18.4%), it is not saved by our Dow theory algorithm. There was a big loss in stock value 24th of July with 20% from close the day before to open. The Dow cannot save you for situations like that and will sell on the far bottom.

The Apple (AAPL) is also missing a lot of gain. The stock is in a great growth in 2020, with some challenges in March and after (Corona hit). But looking and buy and sell signals, it hits sell higher than the following buy and losses out on gain.

Amazon (AMZN) seems to be the same story. Growth in general and hitting buying on higher than previous sell, and loosing out on profit.

Next steps and considerations

We have made some broad simplifications in our algorithm.

  • Only consider Close value, while a normal way to find the markers are on a OHLC candlestick diagram.
  • If we used the span of the day price, then we might limit our losses with a stop-loss order earlier.
  • This is not an exact science, and the trends might need a different way to identify them.

Hence, the above suggest it can be more adjusted to real life.

Another thing to keep in mind is that you should never make your investment decision on only one indicator or algorithm choice.

Pandas: Calculate a Heatmap to Visualize Historical CAGR Sector Performance

What is CACR and why not use AAGR?

Often when you see financial advisors have statements with awesome returns. These returns might be what is called Annual Average Growth Rates (AAGR). Why should you be skeptical with AAGR?

Simple example will show you.

  • You start by investing 10.000$.
  • First year you get 100% in return, resulting in 20.000$.
  • The year after you have a fall of 50%, which makes your value back to 10.000$

Using AAGR, your investor will tell you you have (100% – 50%)/2 = 25% AAGR or calls it average annual return.

But wait a minute? You have the same amount of money after two years, so how can that be 25%?

With Compound Annual Growth Rate the story is different as it only considers the start and end value. Here the difference is a big 0$, resulting in a 0% CAGR.

The formula for calculating CAGR is.

((end value)/(start value))^(1/years) – 1

As the above example: (10.000/10.000)^(1/2) – 1 = 0

Step 1: Getting access to financial sector data

In this tutorial we will use the Alpha Vantage. To connect to them you need to register to get a API_KEY.

To claim your key go to: https://www.alphavantage.co/support/#api-key

Where you will select Software Developer in the drop-down Which of the following best describes you? Write your organization of choice. Then write your email address and click that you are not a robot. Or are you?

Then it will give you hare API_KEY on the screen (not in a email). The key is probably a 16 upper case character and integer string.

Step 2: Get the sector data to play with

Looking at Pandas-datareaders API you will see you can use the get_sector_performance_av() function.

import pandas_datareader.data as web

API_KEY = "INSERT YOUR KEY HERE"

data = web.get_sector_performance_av(api_key=API_KEY)
print(data)

Remember to change API_KEY to the key you got from Step 1.

You should get an output similar to this one (not showing all columns).

                            RT      1D      5D  ...       3Y       5Y      1
0Y
Communication Services   0.38%   0.38%  -0.20%  ...   24.04%   29.92%   74.7
8%
Information Technology   0.04%   0.04%  -1.36%  ...  104.45%  183.51%  487.3
3%
Consumer Discretionary  -0.06%  -0.06%   1.36%  ...   66.06%   92.37%  384.7
1%
Materials               -0.07%  -0.07%   1.75%  ...   17.50%   37.64%  106.9
0%
Health Care             -0.16%  -0.17%   0.90%  ...   37.21%   43.20%  268.5
8%
Consumer Staples        -0.19%  -0.19%   1.42%  ...   15.96%   27.65%  137.66%
Utilities               -0.38%  -0.38%   0.60%  ...   13.39%   34.79%   99.63%
Financials              -0.61%  -0.61%   3.23%  ...    1.67%   23.89%  119.46%
Industrials             -0.65%  -0.65%   4.45%  ...   12.57%   40.05%  155.56%
Real Estate             -1.23%  -1.23%  -0.63%  ...   12.51%      NaN      NaN
Energy                  -1.99%  -1.99%   1.38%  ...  -39.45%  -44.69%  -29.07%

The columns we are interested in are the 1Y, 3Y, 5Y, and 10Y.

Step 3: Convert columns to floats

As you saw in the previous Step that the columns all contain in %-sign, which tells you that the entries are strings and not floats and need to be converted.

This can be done by some string magic. First we need to remove the %-sign before we convert it to a float.

import pandas_datareader.data as web

API_KEY = "INSERT YOUR KEY HERE"

data = web.get_sector_performance_av(api_key=API_KEY)

for column in data.columns:
  data[column] = data[column].str.rstrip('%').astype('float') / 100.0

print(data[['1Y', '3Y', '5Y' , '10Y']])

Where we convert all columns in the for-loop. Then we print only the columns we need.

                            1Y      3Y      5Y     10Y
Communication Services  0.1999  0.2404  0.2992  0.7478
Information Technology  0.4757  1.0445  1.8351  4.8733
Consumer Discretionary  0.2904  0.6606  0.9237  3.8471
Materials               0.1051  0.1750  0.3764  1.0690
Health Care             0.1908  0.3721  0.4320  2.6858
Consumer Staples        0.0858  0.1596  0.2765  1.3766
Utilities               0.0034  0.1339  0.3479  0.9963
Financials             -0.0566  0.0167  0.2389  1.1946
Industrials             0.0413  0.1257  0.4005  1.5556
Real Estate            -0.0658  0.1251     NaN     NaN
Energy                 -0.3383 -0.3945 -0.4469 -0.2907

All looking nice. Also, notice that we converted them to float values and not in %-values by dividing by 100.

Step 4: Calculate the CAGR

Now we need to use the formula on the columns.

import pandas_datareader.data as web

API_KEY = "INSERT YOUR KEY HERE"

data = web.get_sector_performance_av(api_key=API_KEY)

for column in data.columns:
  data[column] = data[column].str.rstrip('%').astype('float') / 100.0

data['1Y-CAGR'] = data['1Y']*100
data['3Y-CAGR'] = ((1 + data['3Y']) ** (1/3) - 1) * 100
data['5Y-CAGR'] = ((1 + data['5Y']) ** (1/5) - 1) * 100
data['10Y-CAGR'] = ((1 + data['10Y']) ** (1/10) - 1) * 100

cols = ['1Y-CAGR','3Y-CAGR', '5Y-CAGR', '10Y-CAGR']

print(data[cols])

This should result in something similar.

                        1Y-CAGR    3Y-CAGR    5Y-CAGR   10Y-CAGR
Communication Services    19.99   7.445258   5.374421   5.742403
Information Technology    47.57  26.919700  23.172477  19.368083
Consumer Discretionary    29.04  18.419079  13.979689  17.097655
Materials                 10.51   5.522715   6.597970   7.541490
Health Care               19.08  11.120773   7.445592  13.933956
Consumer Staples           8.58   5.059679   5.003594   9.042452
Utilities                  0.34   4.277734   6.152820   7.157502
Financials                -5.66   0.553596   4.377587   8.177151
Industrials                4.13   4.025758   6.968677   9.837158
Real Estate               -6.58   4.007273        NaN        NaN
Energy                   -33.83 -15.399801 -11.169781  -3.376449

Looks like the Information Technology sector is very lucrative.

But to make it more digestible we should visualize it.

Step 5: Create a heatmap

We will use the seaborn library to create it, which is a statistical data visualizing library.

The heatmap endpoint is defined to simply take the DataFrame to visualize. It could not be easier.

import pandas_datareader.data as web
import seaborn as sns
import matplotlib.pyplot as plt

API_KEY = "INSERT YOUR KEY HERE"

data = web.get_sector_performance_av(api_key=API_KEY)

for column in data.columns:
  data[column] = data[column].str.rstrip('%').astype('float') / 100.0

data['1Y-CAGR'] = data['1Y']*100
data['3Y-CAGR'] = ((1 + data['3Y']) ** (1/3) - 1) * 100
data['5Y-CAGR'] = ((1 + data['5Y']) ** (1/5) - 1) * 100
data['10Y-CAGR'] = ((1 + data['10Y']) ** (1/10) - 1) * 100

cols = ['1Y-CAGR','3Y-CAGR', '5Y-CAGR', '10Y-CAGR']

sns.heatmap(data[cols], annot=True, cmap="YlGnBu")
plt.show()

Resulting in the following output.

Pandas: Calculate and plot the Bollinger Bands for a Stock

What is the Bollinger Bands?

A Bollinger Band® is a technical analysis tool defined by a set of trendlines plotted two standard deviations (positively and negatively) away from a simple moving average (SMA) of a security’s price, but which can be adjusted to user preferences.

https://www.investopedia.com/terms/b/bollingerbands.asp

The Bollinger Bands are used to discover if a stock is oversold or overbought. It is called a mean reversion indicator, which measures how far a price swing will stretch before a counter impulse triggers a retracement.

It is a lagging indicator, which is looking at historical background of the current price. Opposed to a leading indicator, which tries to where the price is heading.

Step 1: Get some time series data on a stock

In this tutorial we will use the Apple stock as example, which has ticker AAPL. You can change to any other stock of your interest by changing the ticker below. To find the ticker of your favorite company/stock you can use Yahoo! Finance ticker lookup.

To get some time series of stock data we will use the Pandas-datareader library to collect it from Yahoo! Finance.

import pandas_datareader as pdr
import datetime as dt


ticker = pdr.get_data_yahoo("AAPL", dt.datetime(2020, 1, 1), dt.datetime.now())[['Close', 'High', 'Low']]
print(ticker)

We will use the Close, High and Low columns to do the further calculations.

                 Close        High         Low
Date                                          
2020-01-02  300.350006  300.600006  295.190002
2020-01-03  297.429993  300.579987  296.500000
2020-01-06  299.799988  299.959991  292.750000
2020-01-07  298.390015  300.899994  297.480011
2020-01-08  303.190002  304.440002  297.160004
...                ...         ...         ...
2020-08-06  455.609985  457.649994  439.190002
2020-08-07  444.450012  454.700012  441.170013
2020-08-10  450.910004  455.100006  440.000000
2020-08-11  437.500000  449.929993  436.429993
2020-08-12  452.040009  453.100006  441.190002

Step 2: How are the Bollinger Bands calculated

Luckily, we can refer to Investopedia.org to get the answer, which states that the Bollinger Bands are calculated as follows.

BOLU=MA(TP,n)+mσ[TP,n]

BOLD=MA(TP,n)−mσ[TP,n]

Where BOLU is the Upper Bollinger Band and BOLD is Lower Bollinger Band. The MA is the Moving Average. The TP and σ are calculated as follows.

TP (typical price)=(High+Low+Close)÷3

σ[TP,n] = Standard Deviation over last n periods of TP​

Where n is the number of days in smoothing period (typically 20), and m is the number of standard deviations (typically 2).

Step 3: Calculate the Bollinger Bands

This is straight forward. We start by calculating the typical price TP and then the standard deviation over the last 20 days (the typical value). Then we calculate the simple moving average of rolling over the last 20 days (the typical value). Then we have the values to calculate the upper and lower values of the Bolling Bands (BOLU and BOLD).

ticker['TP'] = (ticker['Close'] + ticker['Low'] + ticker['High'])/3
ticker['std'] = ticker['TP'].rolling(20).std(ddof=0)
ticker['MA-TP'] = ticker['TP'].rolling(20).mean()
ticker['BOLU'] = ticker['MA-TP'] + 2*ticker['std']
ticker['BOLD'] = ticker['MA-TP'] - 2*ticker['std']
print(ticker)

Resulting in the following output.

Date                                          
                 Close        High  ...        BOLU        BOLD
Date                                ...                        
2020-01-02  300.350006  300.600006  ...         NaN         NaN
2020-01-03  297.429993  300.579987  ...         NaN         NaN
2020-01-06  299.799988  299.959991  ...         NaN         NaN
2020-01-07  298.390015  300.899994  ...         NaN         NaN
2020-01-08  303.190002  304.440002  ...         NaN         NaN
...                ...         ...  ...         ...         ...
2020-08-06  455.609985  457.649994  ...  445.784036  346.919631
2020-08-07  444.450012  454.700012  ...  453.154374  346.012626
2020-08-10  450.910004  455.100006  ...  459.958160  345.317173
2020-08-11  437.500000  449.929993  ...  464.516981  346.461685
2020-08-12  452.040009  453.100006  ...  469.891271  346.836730

Note, that if you compare you results with Yahoo! Finance for Apple, there will be some small difference. The reason is, that they by default use TP to be closing price and not the average of the Close, Low and High. If you change TP to equal Close only, you will get the same figures as they do.

Step 4: Plotting it on a graph

Plotting the three lines is straight forward by using plot() on the DataFrame. Making an filled area with color between BOLU and BOLD can be achieved by using fill_between().

This results in the full program to be.

import pandas_datareader as pdr
import datetime as dt
import matplotlib.pyplot as plt


ticker = pdr.get_data_yahoo("AAPL", dt.datetime(2020, 1, 1), dt.datetime.now())[['Close', 'High', 'Low']]

# Boillinger band calculations
ticker['TP'] = (ticker['Close'] + ticker['Low'] + ticker['High'])/3
ticker['std'] = ticker['TP'].rolling(20).std(ddof=0)
ticker['MA-TP'] = ticker['TP'].rolling(20).mean()
ticker['BOLU'] = ticker['MA-TP'] + 2*ticker['std']
ticker['BOLD'] = ticker['MA-TP'] - 2*ticker['std']
ticker = ticker.dropna()
print(ticker)

# Plotting it all together
ax = ticker[['Close', 'BOLU', 'BOLD']].plot(color=['blue', 'orange', 'yellow'])
ax.fill_between(ticker.index, ticker['BOLD'], ticker['BOLU'], facecolor='orange', alpha=0.1)
plt.show()

Giving the following graph.

Apple Stock Closing price with Bollinger Band indicators

Step 5: How to use the Bollinger Band Indicator?

If the stock price are continuously touching the upper Bollinger Band (BOLU) the market is thought to be overbought. While if the price continuously touches the lower Bollinger Band (BOLD) the market is thought to be oversold.

The more volatile the market is, the wider the upper and lower band will be. Hence, it also indicates how volatile the market is at a given period.

The volatility measured by the Bollinger Band is referred to as a squeeze when the upper and lower band are close. This is considered to be a sign that there will be more volatility in the coming future, which opens up for possible trading opportunities.

A common misconception of the bands are that when the price outbreaks the the bounds of the upper and lower band, it is a trading signal. This is not the case.

As with all trading indicators, it should not be used alone to make trading decisions.

Pandas: Calculate the Stochastic Oscillator Indicator for Stocks

What is the Stochastic Oscillator Indicator for a stock?

A stochastic oscillator is a momentum indicator comparing a particular closing price of a security to a range of its prices over a certain period of time.

https://www.investopedia.com/terms/s/stochasticoscillator.asp

The stochastic oscillator is an indicator for the speed and momentum of the price. The indicator changes direction before the price does and is therefore a leading indicator.

Step 1: Get stock data to do the calculations on

In this tutorial we will use the Apple stock as example, which has ticker AAPL. You can change to any other stock of your interest by changing the ticker below. To find the ticker of your favorite company/stock you can use Yahoo! Finance ticker lookup.

To get some time series of stock data we will use the Pandas-datareader library to collect it from Yahoo! Finance.

import pandas_datareader as pdr
import datetime as dt


ticker = pdr.get_data_yahoo("AAPL", dt.datetime(2020, 1, 1), dt.datetime.now())
print(ticker)

Where we only focus on data from 2020 until today.

                  High         Low  ...      Volume   Adj Close
Date                                ...                        
2020-01-02  300.600006  295.190002  ...  33870100.0  298.292145
2020-01-03  300.579987  296.500000  ...  36580700.0  295.392120
2020-01-06  299.959991  292.750000  ...  29596800.0  297.745880
2020-01-07  300.899994  297.480011  ...  27218000.0  296.345581
2020-01-08  304.440002  297.160004  ...  33019800.0  301.112640
...                ...         ...  ...         ...         ...
2020-08-05  441.570007  435.589996  ...  30498000.0  439.457642
2020-08-06  457.649994  439.190002  ...  50607200.0  454.790009
2020-08-07  454.700012  441.170013  ...  49453300.0  444.450012
2020-08-10  455.100006  440.000000  ...  53100900.0  450.910004
2020-08-11  449.929993  436.429993  ...  46871100.0  437.500000

[154 rows x 6 columns]

The output does not show all the columns, which are: High, Low, Open, Close, Volume, and Adj Close.

Step 2: Understand the calculation of Stochastic Oscillator Indicator

The Stochastic Oscillator Indicator consists of two values calculated as follows.

%K = (Last Close – Lowest low) / (Highest high – Lowest low)

%D = Simple Moving Average of %K

What %K looks at is the Lowest low and Highest high in a window of some days. The default is 14 days, but can be changed. I’ve seen others use 20 days, as the stock market is open 20 days per month. The original definition set it to 14 days. The simple moving average was set to 3 days.

The numbers are converted to percentage, hence the indicators are in the range of 0% to 100%.

The idea is that if the indicators are above 80%, it is considered to be in the overbought range. While if it is below 20%, then it is considered to be oversold.

Step 3: Calculate the Stochastic Oscillator Indicator

With the above description it is straight forward to do.

import pandas_datareader as pdr
import datetime as dt


ticker = pdr.get_data_yahoo("AAPL", dt.datetime(2020, 1, 1), dt.datetime.now())

ticker['14-high'] = ticker['High'].rolling(14).max()
ticker['14-low'] = ticker['Low'].rolling(14).min()
ticker['%K'] = (ticker['Close'] - ticker['14-low'])*100/(ticker['14-high'] - ticker['14-low'])
ticker['%D'] = ticker['%K'].rolling(3).mean()
print(ticker)

Resulting in the following output.

                  High         Low  ...         %K         %D
Date                                ...                      
2020-01-02  300.600006  295.190002  ...        NaN        NaN
2020-01-03  300.579987  296.500000  ...        NaN        NaN
2020-01-06  299.959991  292.750000  ...        NaN        NaN
2020-01-07  300.899994  297.480011  ...        NaN        NaN
2020-01-08  304.440002  297.160004  ...        NaN        NaN
...                ...         ...  ...        ...        ...
2020-08-05  441.570007  435.589996  ...  92.997680  90.741373
2020-08-06  457.649994  439.190002  ...  97.981589  94.069899
2020-08-07  454.700012  441.170013  ...  86.939764  92.639677
2020-08-10  455.100006  440.000000  ...  93.331365  92.750906
2020-08-11  449.929993  436.429993  ...  80.063330  86.778153

[154 rows x 10 columns]

Please notice that we have not included all columns here. Also, see the the %K and %D are not available for the first days, as it needs 14 days of data to be calculated.

Step 4: Plotting the data on a graph

We will combine two graphs in one. This can be easily obtained using Pandas DataFrames plot function. The argument secondary_y can be used to plot up against two y-axis.

The two lines %K and %D are both on the same scale 0-100, while the stock prices are on a different scale depending on the specific stock.

To keep things simple, we also want to plot a line indicator of the 80% high line and 20% low line. This can be done by using the axhline from the Axis object that plot returns.

The full code results in the following.

import pandas_datareader as pdr
import datetime as dt
import matplotlib.pyplot as plt


ticker = pdr.get_data_yahoo("AAPL", dt.datetime(2020, 1, 1), dt.datetime.now())

ticker['14-high'] = ticker['High'].rolling(14).max()
ticker['14-low'] = ticker['Low'].rolling(14).min()
ticker['%K'] = (ticker['Close'] - ticker['14-low'])*100/(ticker['14-high'] - ticker['14-low'])
ticker['%D'] = ticker['%K'].rolling(3).mean()

ax = ticker[['%K', '%D']].plot()
ticker['Adj Close'].plot(ax=ax, secondary_y=True)
ax.axhline(20, linestyle='--', color="r")
ax.axhline(80, linestyle="--", color="r")
plt.show()

Resulting in the following graph.

Apple with Stochastic Oscillator Indicator %K and %D

Step 5: Interpreting the signals.

First a word of warning. Most advice from only using one indicator alone as a buy-sell signal. This also holds for the Stochastic Oscillator indicator. As the name suggest, it is only an indicator, not a predictor.

The indicator signals buy or sell when the two lines crosses each other. If the %K is above the %D then it signals buy and when it crosses below, it signals sell.

Looking at the graph it makes a lot of signals (every time the two lines crosses each other). This is a good reason to have other indicators to rely on.

An often misconception is that it should only be used when it is in the regions of 20% low or 80% high. But it is often that low and high can be for quite some time. Hence, selling if we reach the 80% high in this case, we would miss a great opportunity of a big gain.