How To use Matplotlib Object Oriented with NumPy and Pandas

What will we cover in this tutorial?

If you like data visualization with NumPy and Pandas, then you must have encountered Matplotlib.

And if you also, like to program in an object oriented fashion, then most tutorial will make you feel wondering if no one loves the art of beautiful code?

Let me elaborate. The integration and interaction with Matplotlib is done in a functional way with a lot of side effects. Not nice.

Not sure what I talk about? We will cover that too.

Step 1: How NumPy is demonstrated to make plots with Matplotlib and what is wrong with it

Let’s make a simple example.

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 5, 11)
y = x ** 2
plt.plot(x, y)
plt.xlabel("X Label")
plt.ylabel("Y Label")
plt.title("Title")
plt.show()

This will result in the following chart.

That is nice and easy! So what is wrong with it?

Side effects!

What is a side effect in programming?

…that is to say has an observable effect besides returning a value (the main effect) to the invoker of the operation.

https://en.wikipedia.org/wiki/Side_effect_(computer_science)

What does that mean?

Well, let’s examine the above example.

We call plt.plt(x, y) and what happens? Actually we don’t know. We do not get anything in return.

Continue to call plt.xlabel(…), plt.ylabel(…), and plt.title(…). Then we call plt.show() to see the result. Hence, we change the state of the plt library we imported. See, we did not create an object. We call the library directly.

This is difficult as a programmer to understand without having deep knowledge of the library used.

So how to do it in more understandable way?

Step 2: How to create a chart with Matplotlib with NumPy in an object oriented way and why it is better

Let’s look at this code and examine it.

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 5, 11)
y = x ** 2

fig, ax = plt.subplots()
ax.plot(x, y)
ax.set_xlabel("X Label")
ax.set_ylabel("Y Label")
ax.set_title("Title")
fig.show()
plt.waitforbuttonpress()

Here we do it differently but get the same result. It is more understandable that when we call a method on object ax, that the state of ax is changing and not something in the library hidden in some side effect.

You can also show the the figure fig by calling show() and not the library. This requires that we add waitforbuttonpress() on plt, otherwise it will destroy the window immediately.

Note, that you do not have these challenges in JuPyter notebook – the plots are shown without the call to show.

You could keep the plt.show() instead of fig.show() and plt.waitforbuttonpress(). But the above code is more intuitive and easier to understand.

How to create a chart with Matplotlib of a Pandas DataFrame in an object oriented way

This is straight forward as Matplotlib is well integrated with Pandas.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

x = np.linspace(0, 5, 11)
y = x ** 2

df = pd.DataFrame(data=y, index=x)

fig, ax = plt.subplots()
ax.plot(df)
ax.set_xlabel("X Label")
ax.set_ylabel("Y Label")
ax.set_title("Title")
fig.show()
plt.waitforbuttonpress()

Notice, that the DataFrame is created from the NumPy arrays. Hence, here we do not gain anything from using it. This is just to exemplify how easy it is to use s in an object oriented way with Pandas.

Final thoughts

I have found that programmer either hate or love Matplotlib. I do not always know why, but I have discovered that this non-object oriented way of using Matplotlib is annoying some programmers.

This is a good reason to hate it, but I would say that there are no good alternative to Matplotlib – or at least, they are build upon Matplotlib.

I like the power and ease using Matplotlib. I do like that the option of using it object oriented, which makes the code more intuitive and easier to understand for other programmers.

How To Extract Numbers From Strings in HTML Table and Export to Excel from Python

What will we cover in this tutorial?

How to import a HTML table to Excel.

But that is easy? You can do that directly from Excel.

Yes, but what if entries contains numbers and string together, then the import will convert it to a string and makes it difficult to get the number extracted from the string.

Luckily, we will cover how to do that easy with Python.

Step 1: Get the dataset

Find your favorite HTML table online. For the purpose of this tutorial I will use this one from Wikipedia with List of Metro Systems.

View of HTML table of interest

Say, what if we wanted to sum how many stations are in this table (please notice that the table contains more rows than shown in the above picture).

If you import that directly into Excel, with the import functionality you will realize that the column of stations will be interpreted as strings. The problem is, that it will look like 19[13], while we are only interested in the number 19.

There is no build in functionality to do that directly in Excel.

But let’s try to import this into Python. We will use Pandas to do that. If you are new to Pandas, please see this tutorial.

import pandas as pd


url = "https://en.wikipedia.org/wiki/List_of_metro_systems"
tables = pd.read_html(url)

print(tables[0].head())

Which will result in the following output.

/Users/admin/PycharmProjects/LearningSpace/venv/bin/python /Users/admin/PycharmProjects/LearningSpace/test.py
           City    Country  ...          System length Annual ridership(millions)
0       Algiers    Algeria  ...  18.5 km (11.5 mi)[14]           45.3 (2019)[R 1]
1  Buenos Aires  Argentina  ...  56.7 km (35.2 mi)[16]          337.7 (2018)[R 2]
2       Yerevan    Armenia  ...   13.4 km (8.3 mi)[17]           20.2 (2019)[R 3]
3        Sydney  Australia  ...  36 km (22 mi)[19][20]  14.2 (2019) [R 4][R Nb 1]
4        Vienna    Austria  ...  83.3 km (51.8 mi)[21]          459.8 (2019)[R 6]

Where we have the same problem. If we inspect the type of the columns we get the following.

City                          object
Country                       object
Name                          object
Yearopened                    object
Year of lastexpansion         object
Stations                      object
System length                 object
Annual ridership(millions)    object
dtype: object

Where actually all columns are of type object, which here is equivalent to a string.

Step 2: Extract the numbers from Stations and System length column

The DataStructure of the tables in tables is a DataFrame, which is Pandas main data structure.

As the strings we want to convert from string to integers are containing more information than just the numbers, we cannot use the DataFrame method to_numeric().

We want to convert something of the form 19[13] to 19.

To do that easily, we will use the apply(…) method on the DataFrame.

The apply-method takes a function as argument and applies it on each row.

We will use a lambda function as argument. If you are not familiar with lambda functions, please read this tutorial.

import pandas as pd


url = "https://en.wikipedia.org/wiki/List_of_metro_systems"
tables = pd.read_html(url)
table = tables[0]

table['Stations'] = table.apply(lambda row: int(row['Stations'].split('[')[0]), axis=1)
table['System length'] = table.apply(lambda row: float(row['System length'].split()[0]), axis=1)

print(table[['Stations', 'System length']].head())

Which will result in the following output.

   Stations  System length
0        19           18.5
1        90           56.7
2        10           13.4
3        13           36.0
4        98           83.3

This is what we want.

Step 3: Export to Excel

Wow. This needs an entire step?

Well, of course it does.

Here we need to unleash the power of Pandas and use the to_excel(…) method.

import pandas as pd


url = "https://en.wikipedia.org/wiki/List_of_metro_systems"
tables = pd.read_html(url)
table = tables[0]

table['Stations'] = table.apply(lambda row: int(row['Stations'].split('[')[0]), axis=1)
table['System length'] = table.apply(lambda row: float(row['System length'].split()[0]), axis=1)

table.to_excel('output.xlsx')

This will result in an Excel file looking similar to this, where the Stations and System length columns are numeric and not string.

Excel file now with Stations and System length as numbers and not strings

What’s next?

Want to learn more about Python and Excel?

Check out my online guide.

How to Export Pandas DataFrame to Excel and Create a Trendline Graph of Scatter Plot

What will we cover in this tutorial?

We will have some data in a Pandas DataFrame, which we want to export to an Excel sheet. Then we want to create a Scatter plot graph and fit that to a Excel trendline.

Step 1: Get the data

You might have some data already that you want to use. It can be from a HTML page (example) or CSV file.

For this purpose here we just generate some random data to use. We will use NumPy’s uniform function to generate it.

import pandas as pd
import numpy as np


# Generate some random increasing data
data = pd.DataFrame(
    {'A': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)],
     'B': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)]}
)
print(data)

Which will generate some slightly increasing data, which is nice to fit a graph to.

The output could look something like this.

            A          B
0    0.039515   0.778077
1    0.451888   0.210705
2    0.992493   0.961428
3    0.317536   1.046444
4    1.220419   1.388086

Step 2: Create an Excel XlsxWriter engine

This step might require that you install the XlsxWriter library, which is needed from the Pandas library.

This can be done by the following command.

pip install xlsxwriter

Now we can create the engine in our code.

import pandas as pd
import numpy as np


# Generate some random increasing data
data = pd.DataFrame(
    {'A': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)],
     'B': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)]}
)

# Create a Pandas Excel writer using XlsxWriter
excel_file = 'output.xlsx'
sheet_name = 'Data set'
writer = pd.ExcelWriter(excel_file, engine='xlsxwriter')

This will setup a Excel writer engine and be ready to write to file output.xlsx.

Step 3: Write the data to Excel and create a scatter graph with a fitted Trendline

This can be done by the following code, which uses the add_series function to insert a graph.

import pandas as pd
import numpy as np


# Generate some random increasing data
data = pd.DataFrame(
    {'A': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)],
     'B': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)]}
)

# Create a Pandas Excel writer using XlsxWriter
excel_file = 'output.xlsx'
sheet_name = 'Data set'
writer = pd.ExcelWriter(excel_file, engine='xlsxwriter')
data.to_excel(writer, sheet_name=sheet_name)

# Access the XlsxWriter workbook and worksheet objects from the dataframe.
workbook = writer.book
worksheet = writer.sheets[sheet_name]

# Create a scatter chart object.
chart = workbook.add_chart({'type': 'scatter'})

# Get the number of rows and column index
max_row = len(data)
col_x = data.columns.get_loc('A') + 1
col_y = data.columns.get_loc('B') + 1

# Create the scatter plot, use a trendline to fit it
chart.add_series({
    'name':       "Samples",
    'categories': [sheet_name, 1, col_x, max_row, col_x],
    'values':     [sheet_name, 1, col_y, max_row, col_y],
    'marker':     {'type': 'circle', 'size': 4},
    'trendline': {'type': 'linear'},
})

# Set name on axis
chart.set_x_axis({'name': 'Concentration'})
chart.set_y_axis({'name': 'Measured',
                  'major_gridlines': {'visible': False}})

# Insert the chart into the worksheet in field D2
worksheet.insert_chart('D2', chart)

# Close and save the Excel file
writer.save()

Result

The result should be similar to this.

The resulting Excel sheet.

That is how it can be done.

Pandas + GeoPandas + OpenCV: Create a Video of COVID-19 World Map

What will we cover?

How to create a video like the one below using Pandas + GeoPandas + OpenCV in Python.

  1. How to collect newest COVID-19 data in Python using Pandas.
  2. Prepare data and calculate values needed to create Choropleth map
  3. Get Choropleth map from GeoPandas and prepare to combine it
  4. Get the data frame by frame to the video
  5. Combine it all to a video using OpenCV

Step 1: Get the daily reported COVID-19 data world wide

This data is available from the European Centre for Disease Prevention and Control and can be found here.

All we need is to download the csv file, which has all the historic data from all the reported countries.

This can be done as follows.

import pandas as pd


# Just to get more rows, columns and display width
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 300)
pd.set_option('display.width', 1000)

# Get the updated data
table = pd.read_csv("https://opendata.ecdc.europa.eu/covid19/casedistribution/csv")

print(table)

This will give us an idea of how the data is structured.

          dateRep  day  month  year  cases  deaths countriesAndTerritories geoId countryterritoryCode  popData2019 continentExp  Cumulative_number_for_14_days_of_COVID-19_cases_per_100000
0      01/10/2020    1     10  2020     14       0             Afghanistan    AF                  AFG   38041757.0         Asia                                           1.040961         
1      30/09/2020   30      9  2020     15       2             Afghanistan    AF                  AFG   38041757.0         Asia                                           1.048847         
2      29/09/2020   29      9  2020     12       3             Afghanistan    AF                  AFG   38041757.0         Asia                                           1.114565         
3      28/09/2020   28      9  2020      0       0             Afghanistan    AF                  AFG   38041757.0         Asia                                           1.343261         
4      27/09/2020   27      9  2020     35       0             Afghanistan    AF                  AFG   38041757.0         Asia                                           1.540413         
...           ...  ...    ...   ...    ...     ...                     ...   ...                  ...          ...          ...                                                ...         
46221  25/03/2020   25      3  2020      0       0                Zimbabwe    ZW                  ZWE   14645473.0       Africa                                                NaN         
46222  24/03/2020   24      3  2020      0       1                Zimbabwe    ZW                  ZWE   14645473.0       Africa                                                NaN         
46223  23/03/2020   23      3  2020      0       0                Zimbabwe    ZW                  ZWE   14645473.0       Africa                                                NaN         
46224  22/03/2020   22      3  2020      1       0                Zimbabwe    ZW                  ZWE   14645473.0       Africa                                                NaN         
46225  21/03/2020   21      3  2020      1       0                Zimbabwe    ZW                  ZWE   14645473.0       Africa                                                NaN         

[46226 rows x 12 columns]

First we want to convert the dateRep to a date object (cannot be seen in the above, but the dates are represented by a string). Then use that as index for easier access later.

import pandas as pd


# Just to get more rows, columns and display width
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 300)
pd.set_option('display.width', 1000)

# Get the updated data
table = pd.read_csv("https://opendata.ecdc.europa.eu/covid19/casedistribution/csv")

# Convert dateRep to date object
table['date'] = pd.to_datetime(table['dateRep'], format='%d/%m/%Y')
# Use date for index
table = table.set_index('date')

Step 2: Prepare data and compute values needed for plot

What makes sense to plot?

Good question. In a Choropleth map you will color according to a value. Here we will color in darker red the higher the value a country is represented with.

If we plotted based on number new COVID-19 cases, this would be high for countries with high populations. Hence, the number of COVID-19 cases per 100,000 people is used.

Using new COVID-19 cases per 100,000 people can be volatile and change drastic from day to day. To even that out, a 7 days rolling sum can be used. That is, you take the sum of the last 7 days and continue that process through your data.

To make it even less volatile, the average of the last 14 days of the 7 days rolling sum is used.

And no, it is not just something invented by me. It is used by the authorities in my home country to calculate rules of which countries are open for travel or not.

This can by the data above be calculated by computing that data.

def get_stat(country_code, table):
    data = table.loc[table['countryterritoryCode'] == country_code]
    data = data.reindex(index=data.index[::-1])
    data['7 days sum'] = data['cases'].rolling(7).sum()
    data['7ds/100000'] = data['7 days sum'] * 100000 / data['popData2019']
    data['14 mean'] = data['7ds/100000'].rolling(14).mean()
    return data

The above function takes the table we returned from Step 1 and extract a country based on a country code. Then it reverses the data to have the dates in chronological order.

After that, it computes the 7 days rolling sum. Then computes the new cases by the population in the country in size of 100,000 people. Finally, it computes the 14 days average (mean) of it.

Step 3: Get the Choropleth map data and prepare it

GeoPandas is an amazing library to create Choropleth maps. But it does need your attention when you combine it with other data.

Here we want to combine it with the country codes (ISO_A3). If you inspect the data, some of the countries are missing that data.

Other than that the code is straight forward.

import pandas as pd
import geopandas


# Just to get more rows, columns and display width
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 300)
pd.set_option('display.width', 1000)

# Get the updated data
table = pd.read_csv("https://opendata.ecdc.europa.eu/covid19/casedistribution/csv")

# Convert dateRep to date object
table['date'] = pd.to_datetime(table['dateRep'], format='%d/%m/%Y')
# Use date for index
table = table.set_index('date')


def get_stat(country_code, table):
    data = table.loc[table['countryterritoryCode'] == country_code]
    data = data.reindex(index=data.index[::-1])
    data['7 days sum'] = data['cases'].rolling(7).sum()
    data['7ds/100000'] = data['7 days sum'] * 100000 / data['popData2019']
    data['14 mean'] = data['7ds/100000'].rolling(14).mean()
    return data


# Read the data to make a choropleth map
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
world = world[(world.pop_est > 0) & (world.name != "Antarctica")]

# Store data per country to make it easier
data_by_country = {}

for index, row in world.iterrows():
    # The world data is not fully updated with ISO_A3 names
    if row['iso_a3'] == '-99':
        country = row['name']
        if country == "Norway":
            world.at[index, 'iso_a3'] = 'NOR'
            row['iso_a3'] = "NOR"
        elif country == "France":
            world.at[index, 'iso_a3'] = 'FRA'
            row['iso_a3'] = "FRA"
        elif country == 'Kosovo':
            world.at[index, 'iso_a3'] = 'XKX'
            row['iso_a3'] = "XKX"
        elif country == "Somaliland":
            world.at[index, 'iso_a3'] = '---'
            row['iso_a3'] = "---"
        elif country == "N. Cyprus":
            world.at[index, 'iso_a3'] = '---'
            row['iso_a3'] = "---"

    # Add the data for the country
    data_by_country[row['iso_a3']] = get_stat(row['iso_a3'], table)

This will create a dictionary (data_by_country) with the needed data for each country. Notice, we do it like this, because not all countries have the same number of data points.

Step 4: Create a Choropleth map for each date and save it as an image

This can be achieved by using matplotlib.

The idea is to go through all dates and look for each country if they have data for that date and use it if they have.

import pandas as pd
import geopandas
import matplotlib.pyplot as plt


# Just to get more rows, columns and display width
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 300)
pd.set_option('display.width', 1000)

# Get the updated data
table = pd.read_csv("https://opendata.ecdc.europa.eu/covid19/casedistribution/csv")

# Convert dateRep to date object
table['date'] = pd.to_datetime(table['dateRep'], format='%d/%m/%Y')
# Use date for index
table = table.set_index('date')


def get_stat(country_code, table):
    data = table.loc[table['countryterritoryCode'] == country_code]
    data = data.reindex(index=data.index[::-1])
    data['7 days sum'] = data['cases'].rolling(7).sum()
    data['7ds/100000'] = data['7 days sum'] * 100000 / data['popData2019']
    data['14 mean'] = data['7ds/100000'].rolling(14).mean()
    return data


# Read the data to make a choropleth map
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
world = world[(world.pop_est > 0) & (world.name != "Antarctica")]

# Store data per country to make it easier
data_by_country = {}

for index, row in world.iterrows():
    # The world data is not fully updated with ISO_A3 names
    if row['iso_a3'] == '-99':
        country = row['name']
        if country == "Norway":
            world.at[index, 'iso_a3'] = 'NOR'
            row['iso_a3'] = "NOR"
        elif country == "France":
            world.at[index, 'iso_a3'] = 'FRA'
            row['iso_a3'] = "FRA"
        elif country == 'Kosovo':
            world.at[index, 'iso_a3'] = 'XKX'
            row['iso_a3'] = "XKX"
        elif country == "Somaliland":
            world.at[index, 'iso_a3'] = '---'
            row['iso_a3'] = "---"
        elif country == "N. Cyprus":
            world.at[index, 'iso_a3'] = '---'
            row['iso_a3'] = "---"

    # Add the data for the country
    data_by_country[row['iso_a3']] = get_stat(row['iso_a3'], table)

# Create an image per date
for day in pd.date_range('12-31-2019', '10-01-2020'):
    print(day)
    world['number'] = 0.0
    for index, row in world.iterrows():
        if day in data_by_country[row['iso_a3']].index:
            world.at[index, 'number'] = data_by_country[row['iso_a3']].loc[day]['14 mean']

    world.plot(column='number', legend=True, cmap='OrRd', figsize=(15, 5))
    plt.title(day.strftime("%Y-%m-%d"))
    plt.savefig(f'image-{day.strftime("%Y-%m-%d")}.png')
    plt.close()

This will create an image for each day. These images will be combined.

Step 5: Create a video from images with OpenCV

Using OpenCV to create a video from a sequence of images is quite easy. The only thing you need to ensure is that it reads the images in the correct order.

import cv2
import glob

img_array = []
filenames = glob.glob('image-*.png')
filenames.sort()
for filename in filenames:
    print(filename)
    img = cv2.imread(filename)
    height, width, layers = img.shape
    size = (width, height)
    img_array.append(img)

out = cv2.VideoWriter('covid.avi', cv2.VideoWriter_fourcc(*'DIVX'), 15, size)

for i in range(len(img_array)):
    out.write(img_array[i])
out.release()

Where we use the VideoWriter from OpenCV.

This results in this video.

Performance comparison of Numba vs Vectorization vs Lambda function with NumPy

What will we cover in this tutorial?

We will continue our investigation of Numba from this tutorial.

Numba is a just-in-time compiler for Python that works amazingly with NumPy. As we saw in the last tutorial, the built in vectorization can depending on the case and size of instance be faster than Numba.

Here we will explore that further as well to see how Numba compares with lambda functions. Lambda functions has the advantage, that they can be parsed as an argument down to a library that can optimize the performance and not depend on slow Python code.

Step 1: Example of Vectorization slower than Numba

In the previous tutorial we only investigated an example of vectorization, which was faster than Numba. Here we will see, that this is not always the case.

import numpy as np
from numba import jit
import time

size = 100
x = np.random.rand(size, size)
y = np.random.rand(size, size)
iterations = 100000


@jit(nopython=True)
def add_numba(a, b):
    c = np.zeros(a.shape)
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            c[i, j] = a[i, j] + b[i, j]
    return c


def add_vectorized(a, b):
    return a + b


# We call the function once, to precompile the code
z = add_numba(x, y)
start = time.time()
for _ in range(iterations):
    z = add_numba(x, y)
end = time.time()
print("Elapsed (numba, precompiled) = %s" % (end - start))

start = time.time()
for _ in range(iterations):
    z = add_vectorized(x, y)
end = time.time()
print("Elapsed (vectorized) = %s" % (end - start))

Varying the size of the NumPy array, we can see the performance between the two in the graph below.

Where it is clear that the vectorized approach is slower.

Step 2: Try some more complex example comparing vectorized and Numba

A if-then-else can be expressed as vectorized using the Numpy where function.

import numpy as np
from numba import jit
import time


size = 1000
x = np.random.rand(size, size)
iterations = 1000


@jit(nopython=True)
def numba(a):
    c = np.zeros(a.shape)
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            if a[i, j] < 0.5:
                c[i, j] = 1
    return c


def vectorized(a):
    return np.where(a < 0.5, 1, 0)


# We call the numba function to precompile it before we measure it
z = numba(x)
start = time.time()
for _ in range(iterations):
    z = numba(x)
end = time.time()
print("Elapsed (numba, precompiled) = %s" % (end - start))

start = time.time()
for _ in range(iterations):
    z = vectorized(x)
end = time.time()
print("Elapsed (vectorized) = %s" % (end - start))

This results in the following comparison.

That is close, but the vectorized approach is a bit faster.

Step 3: Compare Numba with lambda functions

I am very curious about this. Lambda functions are controversial in Python, and many are not happy about them as they have a lot of syntax, which is not aligned with Python. On the other hand, lambda functions have the advantage that you can send them down in the library that can optimize over the for-loops.

import numpy as np
from numba import jit
import time

size = 1000
x = np.random.rand(size, size)
iterations = 1000


@jit(nopython=True)
def numba(a):
    c = np.zeros((size, size))
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            c[i, j] = a[i, j] + 1
    return c


def lambda_run(a):
    return a.apply(lambda x: x + 1)


# Call the numba function to precompile it before time measurement
z = numba(x)
start = time.time()
for _ in range(iterations):
    z = numba(x)
end = time.time()
print("Elapsed (numba, precompiled) = %s" % (end - start))

start = time.time()
for _ in range(iterations):
    z = vectorized(x)
end = time.time()
print("Elapsed (vectorized) = %s" % (end - start))

Resulting in the following performance comparison.

This is again tight, but the lambda approach is still a bit faster.

Remember, this is a simple lambda function and we cannot conclude that lambda function in general are faster than using Numba.

Conclusion

Learnings since the last tutorial is that we have found an example where simple vectorization is slower than Numba. This still leads to the conclusion that performance highly depends on the task. Further, the lambda function seems to give promising performance. Again, this should be compared to the slow approach of a Python for-loop without Numba just-in-time compiled machine code.

When to use Numba with Python NumPy: Vectorization vs Numba

What will we cover in this tutorial?

You just want your code to run fast, right? Numba is a just-in-time compiler for Python that works amazingly with NumPy. Does that mean we should alway use Numba?

Well, let’s try some examples out and learn. If you know about NumPy, you know you should use vectorization to get speed. Does Numba beat that?

Step 1: Let’s learn how Numba works

Numba will compile the Python code into machine code and run it. What about the just-in-time compiler? That means, the first time it uses the code you want to turn into machine code, it will compile it and run it. The next, or any time later, it will just run it, as it is already compiled.

Let’s try that.

import numpy as np
from numba import jit
import time


@jit(nopython=True)
def full_sum_numba(a):
    sum = 0.0
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            sum += a[i, j]
    return sum


iterations = 1000
size = 10000
x = np.random.rand(size, size)

start = time.time()
full_sum_numba(x)
end = time.time()
print("Elapsed (Numba) = %s" % (end - start))

start = time.time()
full_sum_numba(x)
end = time.time()
print("Elapsed (Numba) = %s" % (end - start))

Where you get.

Elapsed (No Numba) = 0.41634082794189453
Elapsed (No Numba) = 0.11176300048828125

Where you see a difference in runtime.

Oh, did you get what happened in the code? Well, if you put @jit(nopython=True) in front of a function, Numba will try to compile it and run it as machine code.

As you see above, the first time as has an overhead in run-time, because it first compiles and the runs it. The second time, it already has compiled it and can run it immediately.

Step 2: Compare Numba just-in-time code to native Python code

So let us compare how much you gain by using Numba just-in-time (@jit) in our code.

import numpy as np
from numba import jit
import time


def full_sum(a):
    sum = 0.0
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            sum += a[i, j]
    return sum


@jit(nopython=True)
def full_sum_numba(a):
    sum = 0.0
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            sum += a[i, j]
    return sum


iterations = 1000
size = 10000
x = np.random.rand(size, size)

start = time.time()
full_sum(x)
end = time.time()
print("Elapsed (No Numba) = %s" % (end - start))

start = time.time()
full_sum_numba(x)
end = time.time()
print("Elapsed (Numba) = %s" % (end - start))

start = time.time()
full_sum_numba(x)
end = time.time()
print("Elapsed (Numba) = %s" % (end - start))

Here we added a native Python function without the @jit in front and will compare it with one which has. We will compare it here.

Elapsed (No Numba) = 38.08543515205383
Elapsed (No Numba) = 0.41634082794189453
Elapsed (No Numba) = 0.11176300048828125

That is some difference. Also, we have plotted a few more runs in the graph below.

It seems pretty evident.

Step 3: Comparing it with Vectorization

If you don’t know what vectorization is, we can recommend this tutorial. The reason to have vectorization is to move the expensive for-loops into the function call to have optimized code run it.

That sounds a lot like what Numba can do. It can change the expensive for-loops into fast machine code.

But which one is faster?

Well, I think there are two parameters to try out. First, the size of the problem. Second, to see if the number of iterations matter.

import numpy as np
from numba import jit
import time


@jit(nopython=True)
def full_sum_numba(a):
    sum = 0.0
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            sum += a[i, j]
    return sum


def full_sum_vectorized(a):
    return a.sum()


iterations = 1000
size = 10000
x = np.random.rand(size, size)

start = time.time()
full_sum_vectorized(x)
end = time.time()
print("Elapsed (No Numba) = %s" % (end - start))

start = time.time()
full_sum_numba(x)
end = time.time()
print("Elapsed (No Numba) = %s" % (end - start))

start = time.time()
full_sum_numba(x)
end = time.time()
print("Elapsed (No Numba) = %s" % (end - start))

As a function of the size.

It is interesting that Numba is faster for small sized of the problem, while it seems like the vectorized approach outperforms Numba for bigger sizes.

And not surprisingly, the number of iterations only makes the difference bigger.

This is not surprising, as the code in a vectorized call can be more specifically optimized than the more general purpose Numba approach.

Conclusion

Does that mean the Numba does not pay off to use?

No, not at all. First of all, we have only tried it for one vectorized approach, which was obviously very easy to optimize. Secondly, not all loops can be turned into vectorized code. In general it is difficult to have a state in a vectorized approach. Hence, if you need to keep track of some internal state in a loop it can be difficult to find a vectorized approach.

Multiple Time Frame Analysis on a Stock using Pandas

What will we investigate in this tutorial?

A key element to success in trading is to understand the market and the trend of the stock before you buy it. In this tutorial we will not cover how to read the market, but take a top-down analysis approach to stock prices. We will use what is called Multiple Time Frame Analysis on a stock starting with a 1-month, 1-week, and 1-day perspective. Finally, we will compare that with a Simple Moving Average with a monthly view.

Step 1: Gather the data with different time frames

We will use the Pandas-datareader library to collect the time series of a stock. The library has an endpoint to read data from Yahoo! Finance, which we will use as it does not require registration and can deliver the data we need.

import pandas_datareader as pdr
import datetime as dt


ticker = "MSFT"
start = dt.datetime(2019, 1, 1)
end = dt.datetime.now()
day = pdr.get_data_yahoo(ticker, start, end, interval='d')
week = pdr.get_data_yahoo(ticker, start, end, interval='wk')
month = pdr.get_data_yahoo(ticker, start, end, interval='mo')

Where the key is to set the interval to ‘d’ (Day), ‘wk’ (Week), and ‘mo’ (Month).

This will give us 3 DataFrames, each indexed with different intervals.

Dayly.

                  High         Low  ...      Volume   Adj Close
Date                                ...                        
2019-01-02  101.750000   98.940002  ...  35329300.0   98.860214
2019-01-03  100.190002   97.199997  ...  42579100.0   95.223351
2019-01-04  102.510002   98.930000  ...  44060600.0   99.652115
2019-01-07  103.269997  100.980003  ...  35656100.0   99.779205
2019-01-08  103.970001  101.709999  ...  31514400.0  100.502670

Weekly.

                  High         Low  ...       Volume   Adj Close
Date                                ...                         
2019-01-01  103.269997   97.199997  ...  157625100.0   99.779205
2019-01-08  104.879997  101.260002  ...  150614100.0   99.769432
2019-01-15  107.900002  101.879997  ...  127262100.0  105.302940
2019-01-22  107.879997  104.660004  ...  142112700.0  102.731720
2019-01-29  106.379997  102.169998  ...  203449600.0  103.376968

Monthly.

                  High         Low  ...        Volume   Adj Close
Date                                ...                          
2019-01-01  107.900002   97.199997  ...  7.142128e+08  102.096245
2019-02-01  113.239998  102.349998  ...  4.690959e+08  109.526405
2019-03-01  120.820000  108.800003  ...  5.890958e+08  115.796768
2019-04-01  131.369995  118.099998  ...  4.331577e+08  128.226700
2019-05-01  130.649994  123.040001  ...  5.472188e+08  121.432449
2019-06-01  138.399994  119.010002  ...  5.083165e+08  132.012497

Step 2: Combine data and interpolate missing points

The challenge to connect the DataFrames is that they have different index entries. If we add the data points from Daily with Weekly, there will be a lot of missing entries that Daily has, but Weekly does not have.

                   day        week
Date                              
2019-01-02  101.120003         NaN
2019-01-03   97.400002         NaN
2019-01-04  101.930000         NaN
2019-01-07  102.059998         NaN
2019-01-08  102.800003  102.050003
...                ...         ...
2020-08-13  208.699997         NaN
2020-08-14  208.899994         NaN
2020-08-17  210.279999         NaN
2020-08-18  211.490005  209.699997
2020-08-19  209.699997  209.699997

To deal with that we can choose to interpolate by using the DataFrame interpolate function.

import pandas_datareader as pdr
import datetime as dt
import pandas as pd


ticker = "MSFT"
start = dt.datetime(2019, 1, 1)
end = dt.datetime.now()
day = pdr.get_data_yahoo(ticker, start, end, interval='d')
week = pdr.get_data_yahoo(ticker, start, end, interval='wk')
month = pdr.get_data_yahoo(ticker, start, end, interval='mo')

data = pd.DataFrame()
data['day'] = day['Close']
data['week'] = week['Close']
data['week'] = data['week'].interpolate(method='linear')
print(data)

Which results in the following output.

                   day        week
Date                              
2019-01-02  101.120003         NaN
2019-01-03   97.400002         NaN
2019-01-04  101.930000         NaN
2019-01-07  102.059998         NaN
2019-01-08  102.800003  102.050003
...                ...         ...
2020-08-13  208.699997  210.047998
2020-08-14  208.899994  209.931998
2020-08-17  210.279999  209.815997
2020-08-18  211.490005  209.699997
2020-08-19  209.699997  209.699997

Where the missing points (except the first entry) will be linearly put between. This can be done for months as well, but we need to be more careful because of three things. First, some dates (1st of the month) do not exist in the data DataFrame. To solve that we use an outer-join, which will include them. Second, this introduces some extra dates, which are not trading dates. Hence, we need to delete them afterwards, which we can do by deleting the column (drop) and removing rows with NA value (dropna). Thirdly, we also need to understand that the monthly view looks backwards. Hence, the 1st of January is first finalized the last day of January. Therefore we shift it back in the join.

import pandas_datareader as pdr
import datetime as dt
import pandas as pd


ticker = "MSFT"
start = dt.datetime(2019, 1, 1)
end = dt.datetime.now()
day = pdr.get_data_yahoo(ticker, start, end, interval='d')
week = pdr.get_data_yahoo(ticker, start, end, interval='wk')
month = pdr.get_data_yahoo(ticker, start, end, interval='mo')


data = pd.DataFrame()
data['day'] = day['Close']
data['week'] = week['Close']
data['week'] = data['week'].interpolate(method='index')
data = data.join(month['Close'].shift(), how='outer')
data['month'] = data['Close'].interpolate(method='index')
data = data.drop(columns=['Close']).dropna()
data['SMA20'] = data['day'].rolling(20).mean()

Step 3: Visualize the output and take a look at it

To visualize it is straight forward by using matplotlib.

import pandas_datareader as pdr
import datetime as dt
import matplotlib.pyplot as plt
import pandas as pd


ticker = "MSFT"
start = dt.datetime(2019, 1, 1)
end = dt.datetime.now()
day = pdr.get_data_yahoo(ticker, start, end, interval='d')
week = pdr.get_data_yahoo(ticker, start, end, interval='wk')
month = pdr.get_data_yahoo(ticker, start, end, interval='mo')


data = pd.DataFrame()
data['day'] = day['Close']
data['week'] = week['Close']
data['week'] = data['week'].interpolate(method='index')
data = data.join(month['Close'].shift(), how='outer')
data['month'] = data['Close'].interpolate(method='index')
data = data.drop(columns=['Close']).dropna()

data.plot()
plt.show()

Which results in the following graph.

As expected the monthly price is adjusted to be the closing day-price the day before. Hence, it looks like the monthly-curve is crossing the day-curve on the 1st every month (which is almost true).

To really appreciate the Multiple Time Frames Analysis, it is better to keep the graphs separate and interpret them each isolated.

Step 4: How to use these different Multiple Time Frame Analysis

Given the picture it is a good idea to start top down. First look at the monthly picture, which shows the overall trend.

Month view of MFST.

In the case of MSFT it is a clear growing trend, with the exception of two declines. But the overall impression is a company in growth that does not seem to slow down. Even the Dow theory (see this tutorial on it) suggest that there will be secondary movements in a general bull trend.

Secondly, we will look at the weekly view.

Weekly view of MFST

Here your impression is a bit more volatile. It shows many smaller ups and downs, with a big one in March, 2020. It could also indicate a small decline in the growth right and the end. Also, the Dow theory could suggest that it will turn. Though it is not certain.

Finally, on the daily view it gives a more volatile picture, which can be used to when to enter the market.

Day view of MFST

Here you could also be a bit worried. Is this the start of a smaller bull market.

To sum up. In the month-view, we have concluded a growth. The week-view shows signs of possible change. Finally, the day-view is also showing signs of possible decline.

As an investor, and based on the above, I would not enter the market right now. If both the month-view and week-view showed growth, while the day-view decline, that would be a good indicator. You want the top level to show growth, while a day-view might show a small decline.

Finally, remember that you should not just use one way to interpret to enter the market or not.

Step 5: Is monthly the same as a Simple Moving Average?

Good question, I am glad you asked. The Simple Moving Average (SMA) can be calculated easy with DataFrames using rolling and mean function.

Best way is to just try it.

import pandas_datareader as pdr
import datetime as dt
import matplotlib.pyplot as plt
import pandas as pd


ticker = "MSFT"
start = dt.datetime(2019, 1, 1)
end = dt.datetime.now()
day = pdr.get_data_yahoo(ticker, start, end, interval='d')
week = pdr.get_data_yahoo(ticker, start, end, interval='wk')
month = pdr.get_data_yahoo(ticker, start, end, interval='mo')


data = pd.DataFrame()
data['day'] = day['Close']
data['week'] = week['Close']
data['week'] = data['week'].interpolate(method='index')
data = data.join(month['Close'].shift(), how='outer')
data['month'] = data['Close'].interpolate(method='index')
data = data.drop(columns=['Close']).dropna()
data['SMA20'] = data['day'].rolling(20).mean()

data.plot()
plt.show()

As you see, the SMA is not as reactive on the in crisis in March, 2020, as the monthly view is. This shows a difference in them. This does not exclude the one from the other, but shows a difference in how they react.

Comparing the month-view with a Simple Moving Average of a month (20 trade days)

Please remember, that the monthly view is first updated at the end of a month, while SMA is updated on a daily basis.

Other differences is that SMA is an average of the 20 last days, while the monthly is the actual value of the last day of a month (as we look at Close). This implies that the monthly view can be much more volatile than the SMA.

Conclusion

It is advised to make analysis from bigger time frames and zoom in. This way you first look at overall trends, and get a bigger picture of the market. This should eliminate not to fall into being focused on a small detail in the market, but understand it on a higher level.

Master Dow Theory with Python Pandas

What will we cover in this tutorial?

Dow theory was proposed by Charles H. Dow and is not an exact science. It is more how to identify trends in the market. In this tutorial we investigate the approach by testing it on data. Notice, that there are various ways to interpret it and often it is done by visual approximations, while we in this tutorial will make some rough assumptions to see if it beats the buy-and-hold approach of a stock.

First we will make our assumption on how to implement the Dow theory approach to make buy and sell indicators, which we will use as buy and sell markers in the market.

Step 1: Understand the Dow theory to make buy and sell indicators

The essence of Dow theory is that there are 3 types of trend in the market. The primary trend is a year or more long trend, like a bull market. Then on a secondary trend, the market can move in opposite direction for 3 weeks to 3 months. This can result in a pullback, that can seem like a bear market within the bull market. Finally, there are micro trends (less than 3 weeks) which can be considered as noise.

According to Dow theory each market has 3 phases. Our objective as an investor is to identify when a bear market turns into bull market.

Some visual example to understand the above will help a bit. A general bull market with primary and secondary trends could look like this.

Primary bull market trend with secondary bear market trends.

Where you should notice that the temporary lows are all increasing along the way.

A similar picture for a bear market could be.

Primary bear market trend with secondary bull market trends.

Here you should notice how the secondary bull markets peaks are also in a decreasing trend.

Step 2: Identify when a primary market trend changes

The key here is to identify when a primary stock trend goes from bull to bear or opposite.

Please also notice that Dow theory talks about the market and we here are looking at a stock. Hence, we have an assumption that the market and the stock have a strong enough correlation to use the same theory.

From a primary bear to a primary bull market could look like as follows.

From bear to bull market

We have added some markers in the diagram.

  • LL : Low-Low – meaning that the low is lower than previous low.
  • LH : Low-High – meaning that the high is lower than previous high.
  • HH : High-High – meaning that the high is higher than previous high.
  • HL : High-Low – meaning that the low is higher than previous low.

As you see, the bear market consists of consecutive LL and LH, while a bull market consists of consecutive HH and LH. The market changes from bear to bull when we confidently can say that we will get a HH, which we can do when we cross from the last LL over the last LH (before we reach HH).

Hence, a buy signal can be set when we reach a stock price above last LH.

Similar we can investigate the when a primary trends goes from bull to hear market.

From bull to a bear trend.

Where we have the same types of markers.

We see that the trend changes from bull to bear when we go from HL to LL. Hence, a sell indicator is when we are sure we reach a LL (that is before it is a LL).

Again, this is not an exact science and is just a way to interpret it. We will try it out on real stock data to see how it performs.

Step 3: Get some data and calculate points of lows and highs

We will use Pandas-datareader to get the time series data from Yahoo! Finance.

import pandas_datareader as pdr
import datetime as dt


ticker = pdr.get_data_yahoo("TWTR", dt.datetime(2020,1,1), dt.datetime.now())

print(ticker)

Resulting in a time series for Twitter, which has the ticker TWTR. You can find other tickers for other companies by using the Yahoo! Finance ticker lookup.

                 High        Low       Open      Close    Volume  Adj Close
Date                                                                       
2020-01-02  32.500000  31.959999  32.310001  32.299999  10721100  32.299999
2020-01-03  32.099998  31.260000  31.709999  31.520000  14429500  31.520000
2020-01-06  31.709999  31.160000  31.230000  31.639999  12582500  31.639999
2020-01-07  32.700001  31.719999  31.799999  32.540001  13712900  32.540001
2020-01-08  33.400002  32.349998  32.349998  33.049999  14632400  33.049999
...               ...        ...        ...        ...       ...        ...
2020-08-12  38.000000  36.820000  37.500000  37.439999  11013300  37.439999
2020-08-13  38.270000  37.369999  37.430000  37.820000  13259400  37.820000
2020-08-14  37.959999  37.279999  37.740002  37.900002  10377300  37.900002
2020-08-17  38.090000  37.270000  37.950001  37.970001  10188500  37.970001
2020-08-18  38.459999  37.740002  38.279999  38.009998   8548300  38.009998

First thing we need to get is to find the low and highs. First challenge here is that the stock price is going up and down during the day. To simplify our investigation we will only use the Close price.

Taking that decision might limit and not give correct results, but it surely simplifies our work.

Next up, we need to identify highs and lows. This can be done to see when a daily difference goes from positive to negative.

import pandas_datareader as pdr
import datetime as dt


ticker = pdr.get_data_yahoo("TWTR", dt.datetime(2020,1,1), dt.datetime.now())

ticker['delta'] = ticker['Close'].diff()
growth = ticker['delta'] > 0
ticker['markers'] = growth.diff().shift(-1)

print(ticker)

Please notice the shit(-1) as it moves the indicator on the day of the change.

2020-08-05  37.340000  36.410000  36.560001  36.790001   10052100  36.790001  0.440002   False
2020-08-06  37.810001  36.490002  36.849998  37.689999   10478900  37.689999  0.899998    True
2020-08-07  38.029999  36.730000  37.419998  37.139999   11335100  37.139999 -0.549999    True
2020-08-10  39.169998  37.310001  38.360001  37.439999   29298400  37.439999  0.299999    True
2020-08-11  39.000000  36.709999  37.590000  37.279999   20486000  37.279999 -0.160000    True
2020-08-12  38.000000  36.820000  37.500000  37.439999   11013300  37.439999  0.160000   False
2020-08-13  38.270000  37.369999  37.430000  37.820000   13259400  37.820000  0.380001   False
2020-08-14  37.959999  37.279999  37.740002  37.900002   10377300  37.900002  0.080002   False
2020-08-17  38.090000  37.270000  37.950001  37.970001   10188500  37.970001  0.070000   False
2020-08-18  38.459999  37.740002  38.279999  38.009998    8548300  38.009998  0.039997     NaN

Where we have output above. The True values are when we reach Highs or Lows.

Now we have identified all the potential HH, LH, LH, and LL.

Step 4: Implement a simple trial of sell and buy

We continue our example on Twitter and see how we can perform.

Our strategy will be as follows.

  • We either have bought stocks for all our money or not. That is, either we have stocks or not.
  • If we do not have stocks, we buy if stock price is above last high, meaning that a HH is coming.
  • If we do have stocks, we sell if stock price is below last low, meaning that a LL is coming.

This can mean that we enter market in the last of a bull market. If you were to follow the theory complete, it suggest to wait until a bear market changes to a bull market.

import pandas_datareader as pdr
import datetime as dt


ticker = pdr.get_data_yahoo("TWTR", dt.datetime(2020,1,1), dt.datetime.now())

ticker['delta'] = ticker['Close'].diff()
growth = ticker['delta'] > 0
ticker['markers'] = growth.diff().shift(-1)

# We want to remember the last_high and last_low
# Set to max value not to trigger false buy
last_high = ticker['Close'].max()
last_low = 0.0
# Then setup our account, we can only have stocks or not
# We have a start balance of 100000 $
has_stock = False
balance = 100000
stocks = 0
for index, row in ticker.iterrows():
  # Buy and sell orders
  if not has_stock and row['Close'] > last_high:
    has_stock = True
    stocks = balance//row['Close']
    balance -= row['Close']*stocks
  elif has_stock and row['Close'] < last_low:
    has_stock = False
    balance += row['Close']*stocks
    stocks = 0

  # Update the last_high and last_low
  if row['markers']:
    if row['delta'] > 0:
      last_high = row['Close']
    else:
      last_low = row['Close']


print("Dow returns", balance + stocks*ticker['Close'].iloc[-1])

# Compare this with a simple buy and hold approach.
buy_hold_stocks = 100000//ticker['Close'].iloc[0]
buy_hold = 100000 - buy_hold_stocks*ticker['Close'].iloc[0] + buy_hold_stocks*ticker['Close'].iloc[-1]
print("Buy-and-hold return", buy_hold)

Which results in the following results.

Dow returns 120302.0469455719
Buy-and-hold return 117672.44716644287

That looks promising, but it might be just out of luck. Hence, we want to validate with other examples. The results say a return of investment of 20.3% using our Dow theory approach, while a simple buy-and-hold strategy gave 17.7%. This is over the span of less than 8 months.

The thing you would like to achieve with a strategy is to avoid big losses and not loose out on revenue. The above testing does not justify any clarification on that.

Step 5: Try out some other tickers to test it

A first investigation is to check how the algorithm performs on other stocks. We make one small adjustment, as the comparison to buy on day-1, might be quite unfair. If price is low, it an advantage, while if the price is high, it is a big disadvantage. The code below runs on multiple stocks and compare the first buy with a Dow approach (as outlined in this tutorial) with a buy-and-hold approach. The exit of the market might also be unfair.

import pandas_datareader as pdr
import datetime as dt

def dow_vs_hold_and_buy(ticker_name):
  ticker = pdr.get_data_yahoo(ticker_name, dt.datetime(2020,1,1), dt.datetime.now())

  ticker['delta'] = ticker['Close'].diff()
  growth = ticker['delta'] > 0
  ticker['markers'] = growth.diff().shift(-1)

  # We want to remember the last_high and last_low
  # Set to max value not to trigger false buy
  last_high = ticker['Close'].max()
  last_low = 0.0
  # Then setup our account, we can only have stocks or not
  # We have a start balance of 100000 $
  has_stock = False
  balance = 100000
  stocks = 0
  first_buy = None
  for index, row in ticker.iterrows():
    # Buy and sell orders
    if not has_stock and row['Close'] > last_high:
      has_stock = True
      stocks = balance//row['Close']
      balance -= row['Close']*stocks
      if first_buy is None:
        first_buy = index
    elif has_stock and row['Close'] < last_low:
      has_stock = False
      balance += row['Close']*stocks
      stocks = 0

    # Update the last_high and last_low
    if row['markers']:
      if row['delta'] > 0:
        last_high = row['Close']
      else:
        last_low = row['Close']

  dow_returns = balance + stocks*ticker['Close'].iloc[-1]

  # Compare this wiith a simple buy and hold approach.
  buy_hold_stocks = 100000//ticker['Close'].loc[first_buy]
  buy_hold_returns = 100000 - buy_hold_stocks*ticker['Close'].loc[first_buy] + buy_hold_stocks*ticker['Close'].iloc[-1]

  print(ticker_name, dow_returns > buy_hold_returns, round(dow_returns/1000 - 100, 1), round(buy_hold_returns/1000 - 100, 1))


tickers = ["TWTR", "AAPL", "TSLA", "BAC", "KO", "GM", "MSFT", "AMZN", "GOOG", "FB", "INTC", "T"]
for ticker in tickers:
  dow_vs_hold_and_buy(ticker)

Resulting in the following output.

TWTR   True  20.3  14.4
AAPL  False  26.4  52.3
TSLA   True 317.6 258.8
BAC    True -16.3 -27.2
KO     True  -8.2 -14.6
GM     True   8.9 -15.1
MSFT  False  26.2  32.1
AMZN  False  32.8  73.9
GOOG  False   7.1  11.0
FB     True  18.3  18.2
INTC  False -34.9 -18.4
T     False -25.3 -20.8

This paints a different picture. First, it seems more random if it outperforms the buy-and-hold approach.

The one performing best is the General Motors Company (GM), but it might be due to unlucky entering of the market. The stock was high in the beginning of the year, and then fell a lot. Hence, here the Dow helped to exit and enter the market correct.

Intel Corporation (INTC) is working a lot against us. While there is a big loss (-18.4%), it is not saved by our Dow theory algorithm. There was a big loss in stock value 24th of July with 20% from close the day before to open. The Dow cannot save you for situations like that and will sell on the far bottom.

The Apple (AAPL) is also missing a lot of gain. The stock is in a great growth in 2020, with some challenges in March and after (Corona hit). But looking and buy and sell signals, it hits sell higher than the following buy and losses out on gain.

Amazon (AMZN) seems to be the same story. Growth in general and hitting buying on higher than previous sell, and loosing out on profit.

Next steps and considerations

We have made some broad simplifications in our algorithm.

  • Only consider Close value, while a normal way to find the markers are on a OHLC candlestick diagram.
  • If we used the span of the day price, then we might limit our losses with a stop-loss order earlier.
  • This is not an exact science, and the trends might need a different way to identify them.

Hence, the above suggest it can be more adjusted to real life.

Another thing to keep in mind is that you should never make your investment decision on only one indicator or algorithm choice.

Pandas: Calculate the Relative Strength Index (RSI) on a Stock

What is the Relative Strength Index?

The Relative Strength Index (RSI) on a stock is a technical indicator.

The relative strength index (RSI) is a momentum indicator used in technical analysis that measures the magnitude of recent price changes to evaluate overbought or oversold conditions in the price of a stock or other asset. 

https://www.investopedia.com/terms/r/rsi.asp

A technical indicator is a mathematical calculation based on past prices and volumes of a stock. The RSI has a value between 0 and 100. It is said to be overbought if above 70, and oversold if below 30.

Step 1: How to calculate the RSI

To be quite honest, I found the description on investopedia.org a bit confusing. Therefore I went for the Wikipedia description of it. It is done is a couple of steps, so let us do the same.

  1. If previous price is lower than current price, then set the values.
    • U = close_now – close_previous
    • D = 0
  2. While if the previous price is higher than current price, then set the values
    • U = 0
    • D = close_previous – close_now
  3. Calculate the Smoothed or modified moving average (SMMA) or the exponential moving average (EMA) of D and U. To be aligned with the Yahoo! Finance, I have chosen to use the (EMA).
  4. Calculate the relative strength (RS)
    • RS = EMA(U)/EMA(D)
  5. Then we end with the final calculation of the Relative Strength Index (RSI).
    • RSI = 100 – (100 / (1 – RSI))

Notice that the U are the price difference if positive otherwise 0, while D is the absolute value of the the price difference if negative.

Step 2: Get a stock and calculate the RSI

We will use the Pandas-datareader to get some time series data of a stock. If you are new to using Pandas-datareader we advice you to read this tutorial.

In this tutorial we will use Twitter as an examples, which has the TWTR ticker. It you want to do it on some other stock, then you can look up the ticker on Yahoo! Finance here.

Then below we have the following calculations.

import pandas_datareader as pdr
import datetime as dt


ticker = pdr.get_data_yahoo("TWTR", dt.datetime(2020,1,1), dt.datetime.now())

delta = ticker['Close'].diff()
up = delta.clip(lower=0)
down = -1*delta.clip(upper=0)
ema_up = up.ewm(com=13, adjust=False).mean()
ema_down = down.ewm(com=13, adjust=False).mean()
rs = ema_up/ema_down

print(ticker)

To have a naming that is close to the definition and also aligned with Python, we use up for U and down for D.

This results in the following output.

                 High        Low       Open  ...    Volume  Adj Close        RSI
Date                                         ...                                
2020-01-02  32.500000  31.959999  32.310001  ...  10721100  32.299999        NaN
2020-01-03  32.099998  31.260000  31.709999  ...  14429500  31.520000   0.000000
2020-01-06  31.709999  31.160000  31.230000  ...  12582500  31.639999   1.169582
2020-01-07  32.700001  31.719999  31.799999  ...  13712900  32.540001   9.699977
2020-01-08  33.400002  32.349998  32.349998  ...  14632400  33.049999  14.218360
...               ...        ...        ...  ...       ...        ...        ...
2020-08-11  39.000000  36.709999  37.590000  ...  20486000  37.279999  58.645030
2020-08-12  38.000000  36.820000  37.500000  ...  11013300  37.439999  59.532873
2020-08-13  38.270000  37.369999  37.430000  ...  13259400  37.820000  61.639293
2020-08-14  37.959999  37.279999  37.740002  ...  10377300  37.900002  62.086731
2020-08-17  38.090000  37.270000  37.950001  ...  10186900  37.970001  62.498897

This tutorial was written 2020-08-18, and comparing with the RSI for twitter on Yahoo! Finance.

From Yahoo! Finance on Twitter with RSI

As you can see in the lower left corner, the RSI for the same ending day was 62.50, which fits the calculated value. Further checks reveal that they also fit the values of Yahoo.

Step 3: Visualize the RSI with the daily stock price

We will use the matplotlib library to visualize the RSI with the stock price. In this tutorial we will have two rows of graphs by using the subplots function. The function returns an array of axis (along with a figure, which we will not use).

The axis can be parsed to the Pandas DataFrame plot function.

import pandas_datareader as pdr
import datetime as dt
import matplotlib.pyplot as plt


ticker = pdr.get_data_yahoo("TWTR", dt.datetime(2019,1,1), dt.datetime.now())

delta = ticker['Close'].diff()
up = delta.clip(lower=0)
down = -1*delta.clip(upper=0)
ema_up = up.ewm(com=13, adjust=False).mean()
ema_down = down.ewm(com=13, adjust=False).mean()
rs = ema_up/ema_down

ticker['RSI'] = 100 - (100/(1 + rs))

# Skip first 14 days to have real values
ticker = ticker.iloc[14:]

print(ticker)
fig, (ax1, ax2) = plt.subplots(2)
ax1.get_xaxis().set_visible(False)
fig.suptitle('Twitter')

ticker['Close'].plot(ax=ax1)
ax1.set_ylabel('Price ($)')
ticker['RSI'].plot(ax=ax2)
ax2.set_ylim(0,100)
ax2.axhline(30, color='r', linestyle='--')
ax2.axhline(70, color='r', linestyle='--')
ax2.set_ylabel('RSI')

plt.show()

Also, we we remove the x-axis of the first graph (ax1). Adjust the y-axis of the second graph (ax2). Also, we have set two horizontal lines to indicate overbought and oversold at 70 and 30, respectively. Notice, that Yahoo! Finance use 80 and 20 as indicators by default.

The resulting output.

Pandas: Calculate a Heatmap to Visualize Historical CAGR Sector Performance

What is CACR and why not use AAGR?

Often when you see financial advisors have statements with awesome returns. These returns might be what is called Annual Average Growth Rates (AAGR). Why should you be skeptical with AAGR?

Simple example will show you.

  • You start by investing 10.000$.
  • First year you get 100% in return, resulting in 20.000$.
  • The year after you have a fall of 50%, which makes your value back to 10.000$

Using AAGR, your investor will tell you you have (100% – 50%)/2 = 25% AAGR or calls it average annual return.

But wait a minute? You have the same amount of money after two years, so how can that be 25%?

With Compound Annual Growth Rate the story is different as it only considers the start and end value. Here the difference is a big 0$, resulting in a 0% CAGR.

The formula for calculating CAGR is.

((end value)/(start value))^(1/years) – 1

As the above example: (10.000/10.000)^(1/2) – 1 = 0

Step 1: Getting access to financial sector data

In this tutorial we will use the Alpha Vantage. To connect to them you need to register to get a API_KEY.

To claim your key go to: https://www.alphavantage.co/support/#api-key

Where you will select Software Developer in the drop-down Which of the following best describes you? Write your organization of choice. Then write your email address and click that you are not a robot. Or are you?

Then it will give you hare API_KEY on the screen (not in a email). The key is probably a 16 upper case character and integer string.

Step 2: Get the sector data to play with

Looking at Pandas-datareaders API you will see you can use the get_sector_performance_av() function.

import pandas_datareader.data as web

API_KEY = "INSERT YOUR KEY HERE"

data = web.get_sector_performance_av(api_key=API_KEY)
print(data)

Remember to change API_KEY to the key you got from Step 1.

You should get an output similar to this one (not showing all columns).

                            RT      1D      5D  ...       3Y       5Y      1
0Y
Communication Services   0.38%   0.38%  -0.20%  ...   24.04%   29.92%   74.7
8%
Information Technology   0.04%   0.04%  -1.36%  ...  104.45%  183.51%  487.3
3%
Consumer Discretionary  -0.06%  -0.06%   1.36%  ...   66.06%   92.37%  384.7
1%
Materials               -0.07%  -0.07%   1.75%  ...   17.50%   37.64%  106.9
0%
Health Care             -0.16%  -0.17%   0.90%  ...   37.21%   43.20%  268.5
8%
Consumer Staples        -0.19%  -0.19%   1.42%  ...   15.96%   27.65%  137.66%
Utilities               -0.38%  -0.38%   0.60%  ...   13.39%   34.79%   99.63%
Financials              -0.61%  -0.61%   3.23%  ...    1.67%   23.89%  119.46%
Industrials             -0.65%  -0.65%   4.45%  ...   12.57%   40.05%  155.56%
Real Estate             -1.23%  -1.23%  -0.63%  ...   12.51%      NaN      NaN
Energy                  -1.99%  -1.99%   1.38%  ...  -39.45%  -44.69%  -29.07%

The columns we are interested in are the 1Y, 3Y, 5Y, and 10Y.

Step 3: Convert columns to floats

As you saw in the previous Step that the columns all contain in %-sign, which tells you that the entries are strings and not floats and need to be converted.

This can be done by some string magic. First we need to remove the %-sign before we convert it to a float.

import pandas_datareader.data as web

API_KEY = "INSERT YOUR KEY HERE"

data = web.get_sector_performance_av(api_key=API_KEY)

for column in data.columns:
  data[column] = data[column].str.rstrip('%').astype('float') / 100.0

print(data[['1Y', '3Y', '5Y' , '10Y']])

Where we convert all columns in the for-loop. Then we print only the columns we need.

                            1Y      3Y      5Y     10Y
Communication Services  0.1999  0.2404  0.2992  0.7478
Information Technology  0.4757  1.0445  1.8351  4.8733
Consumer Discretionary  0.2904  0.6606  0.9237  3.8471
Materials               0.1051  0.1750  0.3764  1.0690
Health Care             0.1908  0.3721  0.4320  2.6858
Consumer Staples        0.0858  0.1596  0.2765  1.3766
Utilities               0.0034  0.1339  0.3479  0.9963
Financials             -0.0566  0.0167  0.2389  1.1946
Industrials             0.0413  0.1257  0.4005  1.5556
Real Estate            -0.0658  0.1251     NaN     NaN
Energy                 -0.3383 -0.3945 -0.4469 -0.2907

All looking nice. Also, notice that we converted them to float values and not in %-values by dividing by 100.

Step 4: Calculate the CAGR

Now we need to use the formula on the columns.

import pandas_datareader.data as web

API_KEY = "INSERT YOUR KEY HERE"

data = web.get_sector_performance_av(api_key=API_KEY)

for column in data.columns:
  data[column] = data[column].str.rstrip('%').astype('float') / 100.0

data['1Y-CAGR'] = data['1Y']*100
data['3Y-CAGR'] = ((1 + data['3Y']) ** (1/3) - 1) * 100
data['5Y-CAGR'] = ((1 + data['5Y']) ** (1/5) - 1) * 100
data['10Y-CAGR'] = ((1 + data['10Y']) ** (1/10) - 1) * 100

cols = ['1Y-CAGR','3Y-CAGR', '5Y-CAGR', '10Y-CAGR']

print(data[cols])

This should result in something similar.

                        1Y-CAGR    3Y-CAGR    5Y-CAGR   10Y-CAGR
Communication Services    19.99   7.445258   5.374421   5.742403
Information Technology    47.57  26.919700  23.172477  19.368083
Consumer Discretionary    29.04  18.419079  13.979689  17.097655
Materials                 10.51   5.522715   6.597970   7.541490
Health Care               19.08  11.120773   7.445592  13.933956
Consumer Staples           8.58   5.059679   5.003594   9.042452
Utilities                  0.34   4.277734   6.152820   7.157502
Financials                -5.66   0.553596   4.377587   8.177151
Industrials                4.13   4.025758   6.968677   9.837158
Real Estate               -6.58   4.007273        NaN        NaN
Energy                   -33.83 -15.399801 -11.169781  -3.376449

Looks like the Information Technology sector is very lucrative.

But to make it more digestible we should visualize it.

Step 5: Create a heatmap

We will use the seaborn library to create it, which is a statistical data visualizing library.

The heatmap endpoint is defined to simply take the DataFrame to visualize. It could not be easier.

import pandas_datareader.data as web
import seaborn as sns
import matplotlib.pyplot as plt

API_KEY = "INSERT YOUR KEY HERE"

data = web.get_sector_performance_av(api_key=API_KEY)

for column in data.columns:
  data[column] = data[column].str.rstrip('%').astype('float') / 100.0

data['1Y-CAGR'] = data['1Y']*100
data['3Y-CAGR'] = ((1 + data['3Y']) ** (1/3) - 1) * 100
data['5Y-CAGR'] = ((1 + data['5Y']) ** (1/5) - 1) * 100
data['10Y-CAGR'] = ((1 + data['10Y']) ** (1/10) - 1) * 100

cols = ['1Y-CAGR','3Y-CAGR', '5Y-CAGR', '10Y-CAGR']

sns.heatmap(data[cols], annot=True, cmap="YlGnBu")
plt.show()

Resulting in the following output.