Matplotlib Visualization for DataFrame Time Series Data

What will we cover in this tutorial?

We will learn how to visualization time series data in a DataFrame with Matplotlib.

This tutorial will show you.

  • How to use Matplotlib with DataFrames.
  • Use Matplotlib with subplots (the object-oriented way).
  • How to make multiple plots in one figure.
  • How to create bar-plots

Want to access the code directly in Jupyter Notebook?

You can get the Jupyter Notebooks from the GitHub here, where there are also direct links to Colab for an interactive experience.

Step 1: Read time series data into a DataFrame

A DataFrame is a two-dimensional tabular data. It is the primary data structure of Pandas. The data structure contains labeled axes (rows and columns).

To get access to a DataFrame data structure, you need to import the Pandas library.

import pandas as pd

Then we need some time series data. You con download your own CSV file from financial pages like Yahoo! Finance.

For this tutorial we will use a dataset available from the GitHub.

remote_file = "https://raw.githubusercontent.com/LearnPythonWithRune/FinancialDataAnalysisWithPython/main/AAPL.csv"
data = pd.read_csv(remote_file, index_col=0, parse_dates=True)

The pd.read_csv(…) does all the magic. We set the index_col=0, which sets the first column of the CSV data file to be the index. This is the dates.

Then we set parse_dates=True, to ensure that dates are actually parsed as dates and not as strings. This is necessary to take advantage of being time series and index with time intervals.

Step 2: Import Matplotlib in Jupyter Notebook

When you import Matplotlib in Jupyter Notebook, you need to set a rendering mode.

import matplotlib.pyplot as plt
%matplotlib notebook

We will use the notebook mode, which is interactive. This enables you to zoom in on interval, move around, and save the figure.

It is common to use inline mode for rendering in Jupyter Notebook. The inline mode creates a static image, which is not interactive.

Step 3: Use Matplotlib the Object-Oriente way

Matplotlib can be used in a functional way and an object-oriented way. Most use it in a functional way, which often creates more confusion, as it is not always intuitive how it works.

The object-oriented way leads to less confusion for the cost of one extra line of code and parsing one argument. Hence, the price is low for the gain.

fig, ax = plt.subplots()
data['Close'].plot(ax=ax)
ax.set_ylabel("Price")
ax.set_title("AAPL")

The first line returns a figure and axis (fig and ax). The figure is where we put the axis, and the axis is the chart.

The actually plot is made by calling the DataFrame, actually, we access the column Close in this case, which is the Series of the time series of the historic Close prices.

Confused? Don’t worry about the details.

Notice, that we parse ax=ax to the plot. This ensures that we render the chart on the returned axis ax.

Finally, we add a y-label and a title to our axis.

Step 4: Creating multiple charts in one Matplotlib figure

How can we create multiple charts (or axes) in one Matplotlib figure?

Luckily, this is quite easy.

fig, ax = plt.subplots(2, 2)
data['Open'].plot(ax=ax[0, 0], title="Open")
data['High'].plot(ax=ax[0, 1], title="High")
data['Low'].plot(ax=ax[1, 0], title="Low")
data['Close'].plot(ax=ax[1, 1], title="Close")
plt.tight_layout()

Here we see a few differences. First, notice plt.subplots(2, 2), which will return a figure fig, and a list of lists with 2-by-2 axes. Hence, ax is a two dimensional list of axes.

We can access the first axis with ax[0, 0,], and parse it as an argument to plot.

This continues for all the 4 plots we make, as you see.

Finally, we use plt.tight_layout(), which will ensures that the layout of the axes does not overlap. You can try without to see the difference.

Step 5: Create a bar-chart with Matplotlib

Finally, we will make a bar-chart with Matplotlib.

Actually, we will render a horizontal bar-chart.

fig, ax = plt.subplots()
data['Volume'].loc['2020-07-01':'2020-08-15'].plot.barh(ax=ax)

We do it for the volume and only on a limited interval of time. This shows you how to take advantage of the time series aspect of the DataFrame.

Next step

The above is part of the FREE 2h Video course.

Excel Automation with Simple Moving Average from Python

What will we cover in this tutorial?

We will retrieve the historic stock prices and calculate the moving average. Then we will export the data to Excel and insert a chart, but all done from Python.

See the in depth explanation in the YouTube video. It also gives advice on how to interpret the Simple Moving Averages (SMA).

Step 1: Read historic stock prices

We will use the Pandas-datarader to get the historic prices of NFLX (the ticker for Netflix).

import pandas_datareader as pdr
import datetime as dt

ticker = "NFLX"
start = dt.datetime(2019, 1, 1)

data = pdr.get_data_yahoo(ticker, start)
print(data.head())

And you will get the historic data for Netflix from January 1st, 2019.

	High	Low	Open	Close	Volume	Adj Close
Date						
2019-01-02	269.750000	256.579987	259.279999	267.660004	11679500	267.660004
2019-01-03	275.790009	264.429993	270.200012	271.200012	14969600	271.200012
2019-01-04	297.799988	278.540009	281.880005	297.570007	19330100	297.570007
2019-01-07	316.799988	301.649994	302.100006	315.339996	18620100	315.339996
2019-01-08	320.589996	308.010010	319.980011	320.269989	15359200	320.269989

Step 2: Understand Moving Average

We will calculate the Simple Moving Average as defined on Investopedia.

Simple Moving Average

The Simple Moving Average (Now just referred to as Moving Average or MA) is defined by a period of days.

That is, the MA of a period of 10 (MA10) will take the average value of the last 10 close prices. This is done in a rolling way, hence, we will get a MA10 for every trading day in our historic data, except the first 9 days in our dataset.

We can similarly calculate a MA50 and MA200, which is a Moving Average of the last 50 and 200 days, respectively.

Step 3: Calculating the Moving Averages

We can do that by using rolling and mean.

And it is magic.

data['MA10'] = data['Close'].rolling(10).mean()
data['MA50'] = data['Close'].rolling(50).mean()
data['MA200'] = data['Close'].rolling(200).mean()

print(data.tail())

That was easy, right?

	High	Low	Open	Close	Volume	Adj Close	MA10	MA50	MA200
Date									
2021-01-12	501.089996	485.670013	500.000000	494.250000	5990400	494.250000	515.297998	502.918599	477.08175
2021-01-13	512.349976	493.010010	495.500000	507.790009	5032100	507.790009	512.989999	503.559600	477.76590
2021-01-14	514.500000	499.579987	507.350006	500.859985	4177400	500.859985	510.616995	503.894399	478.39270
2021-01-15	506.320007	495.100006	500.000000	497.980011	5890200	497.980011	506.341998	504.109600	479.06220
2021-01-19	509.250000	493.540009	501.000000	501.769989	11996900	501.769989	504.232999	504.205999	479.72065

Step 4: Visualize it with Matplotlib

We can see the data with Matplotlib.

import matplotlib.pyplot as plt

data[['Close', 'MA10', 'MA50']].loc['2020-01-01':].plot()
plt.show()

Resulting in the following plot.

The output

Where you can see how the MA10 and MA50 move according to the price.

Step 5: Export to Excel

Now we will export the data to Excel.

For this we need to import Pandas and use the XlsxWriter engine, where you can find the details of the code.

The code can be found here.

import pandas as pd

data = data.loc['2020-01-01':]
data = data.iloc[::-1]
writer = pd.ExcelWriter("technical.xlsx", 
                        engine='xlsxwriter', 
                        date_format = 'yyyy-mm-dd', 
                        datetime_format='yyyy-mm-dd')

sheet_name = 'Moving Average'
data[['Close', 'MA10', 'MA50']].to_excel(writer, sheet_name=sheet_name)


worksheet = writer.sheets[sheet_name]
workbook = writer.book

# Create a format for a green cell
green_cell = workbook.add_format({
    'bg_color': '#C6EFCE',
    'font_color': '#006100'
})

# Create a format for a red cell
red_cell = workbook.add_format({
    'bg_color': '#FFC7CE',                            
    'font_color': '#9C0006'
})


# Set column width of Date
worksheet.set_column(0, 0, 15)


for col in range(1, 4):
    # Create a conditional formatted of type formula
    worksheet.conditional_format(1, col, len(data), col, {
        'type': 'formula',                                    
        'criteria': '=C2>=D2',
        'format': green_cell
    })

    # Create a conditional formatted of type formula
    worksheet.conditional_format(1, col, len(data), col, {
        'type': 'formula',                                    
        'criteria': '=C2<D2',
        'format': red_cell
    })

# Create a new chart object.
chart1 = workbook.add_chart({'type': 'line'})

# Add a series to the chart.
chart1.add_series({
        'name': "MA10",
        'categories': [sheet_name, 1, 0, len(data), 0],
        'values': [sheet_name, 1, 2, len(data), 2],
})

# Create a new chart object.
chart2 = workbook.add_chart({'type': 'line'})

# Add a series to the chart.
chart2.add_series({
        'name': 'MA50',
        'categories': [sheet_name, 1, 0, len(data), 0],
        'values': [sheet_name, 1, 3, len(data), 3],
})

# Combine and insert title, axis names
chart1.combine(chart2)
chart1.set_title({'name': sheet_name + " " + ticker})
chart1.set_x_axis({'name': 'Date'})
chart1.set_y_axis({'name': 'Price'})

# Insert the chart into the worksheet.
worksheet.insert_chart('F2', chart1)

writer.close()

Where the output will be something similar to this.

Generated Excel sheet

How to Plot Time Series with Matplotlib

What will we cover in this tutorial?

In this tutorial we will show how to visualize time series with Matplotlib. We will do that using Jupyter notebook and you can download the resources (the notebook and data used) from here.

Step 1: What is a time series?

I am happy you asked.

The easiest way to understand it, is to show it. If you downloaded the resources and started the Jupyter notebook execute the following lines.

import pandas as pd

data = pd.read_csv("stock_data.csv", index_col=0, parse_dates=True)

data.head()

This will produce the following output.

	High	Low	Open	Close	Volume	Adj Close
Date						
2020-01-02	86.139999	84.342003	84.900002	86.052002	47660500.0	86.052002
2020-01-03	90.800003	87.384003	88.099998	88.601997	88892500.0	88.601997
2020-01-06	90.311996	88.000000	88.094002	90.307999	50665000.0	90.307999
2020-01-07	94.325996	90.671997	92.279999	93.811996	89410500.0	93.811996
2020-01-08	99.697998	93.646004	94.739998	98.428001	155721500.0	98.428001

You notice the the far left column is called Date and that is the index. This index has a time value, in this case, a date.

Time series data is data “stamped” by a time. In this case, it is time indexed by dates.

The data you see is historic stock prices.

Step 2: How to visualize data with Matplotlib

The above data is kept in a DataFrame (Pandas data object), this makes it straight forward to visualize it.

import matplotlib.pyplot as plt
%matplotlib notebook

data.plot()

Which will result in a chart similar to this one.

Result

This is not impressive. It seems like something is wrong.

Actually, there is not. It just does what you ask for. It plots all the 6 columns all together in one chart. Because the Volume is such a high number, all the other columns are in the same brown line (the one that looks straight).

Step 3: Matplotlib has a functional and object oriented interface

This is often a bit confusing at first.

But Matplotlib has a functional and object oriented interface. We used the functional.

If you try to execute the following in your Jupyter notebook.

data['My col'] = data['Volume']*0.5
data['My col'].plot()

It would seem like nothing happened.

But then investigate your previous plot.

Previous plot

It got updated with a new line. Hence, instead of creating a new chart (or figure) it just added it to the existing one.

If you want to learn more about functional and object oriented way of using Matplotlib we recommend this tutorial.

Step 4: How to make a new figure

What to do?

Well, you need to use the object oriented interface of Matplotlib.

You can do that as follows.

fig1, ax1 = plt.subplots()
data['My col'].plot(ax=ax1)

Which will produce what you are looking for. A new figure.

The new figure

Step 5: Make multiple plots in one figure

This is getting fun.

How can you create multiple plots in one figure?

On creating you actually do that.

fig2, ax2 = plt.subplots(2, 2)

data['Open'].plot(ax=ax2[0, 0])
data['High'].plot(ax=ax2[0, 1])
data['Low'].plot(ax=ax2[1, 0])
data['Close'].plot(ax=ax2[1, 1])
plt.tight_layout()

Notice that subplots(2, 2) creates a 2 times 2 array of axis you can use to create a plot.

This should result in this chart.

Result

Step 6: Make a histogram

This can be done as follows.

fig3, ax3 = plt.subplots()

data.loc[:'2020-01-31', 'Volume'].plot.bar(ax=ax3)

Notice that we only take the first month of the Volume data here (data.loc[:’2020-01-31′, ‘Volume’]).

This should result in this figure.

Step 7: Save the figures

This is straight forward.

fig1.savefig("figure-1.png")
fig2.savefig("figure-2.png")
fig3.savefig("figure-3.png")

And the above figures should be available in the same location you are running your Jupyter notebook.

Next step

If you want to learn more about functional and object oriented way of using Matplotlib we recommend this tutorial.

How To use Matplotlib Object Oriented with NumPy and Pandas

What will we cover in this tutorial?

If you like data visualization with NumPy and Pandas, then you must have encountered Matplotlib.

And if you also, like to program in an object oriented fashion, then most tutorial will make you feel wondering if no one loves the art of beautiful code?

Let me elaborate. The integration and interaction with Matplotlib is done in a functional way with a lot of side effects. Not nice.

Not sure what I talk about? We will cover that too.

Step 1: How NumPy is demonstrated to make plots with Matplotlib and what is wrong with it

Let’s make a simple example.

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 5, 11)
y = x ** 2
plt.plot(x, y)
plt.xlabel("X Label")
plt.ylabel("Y Label")
plt.title("Title")
plt.show()

This will result in the following chart.

That is nice and easy! So what is wrong with it?

Side effects!

What is a side effect in programming?

…that is to say has an observable effect besides returning a value (the main effect) to the invoker of the operation.

https://en.wikipedia.org/wiki/Side_effect_(computer_science)

What does that mean?

Well, let’s examine the above example.

We call plt.plt(x, y) and what happens? Actually we don’t know. We do not get anything in return.

Continue to call plt.xlabel(…), plt.ylabel(…), and plt.title(…). Then we call plt.show() to see the result. Hence, we change the state of the plt library we imported. See, we did not create an object. We call the library directly.

This is difficult as a programmer to understand without having deep knowledge of the library used.

So how to do it in more understandable way?

Step 2: How to create a chart with Matplotlib with NumPy in an object oriented way and why it is better

Let’s look at this code and examine it.

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 5, 11)
y = x ** 2

fig, ax = plt.subplots()
ax.plot(x, y)
ax.set_xlabel("X Label")
ax.set_ylabel("Y Label")
ax.set_title("Title")
fig.show()
plt.waitforbuttonpress()

Here we do it differently but get the same result. It is more understandable that when we call a method on object ax, that the state of ax is changing and not something in the library hidden in some side effect.

You can also show the the figure fig by calling show() and not the library. This requires that we add waitforbuttonpress() on plt, otherwise it will destroy the window immediately.

Note, that you do not have these challenges in JuPyter notebook – the plots are shown without the call to show.

You could keep the plt.show() instead of fig.show() and plt.waitforbuttonpress(). But the above code is more intuitive and easier to understand.

How to create a chart with Matplotlib of a Pandas DataFrame in an object oriented way

This is straight forward as Matplotlib is well integrated with Pandas.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

x = np.linspace(0, 5, 11)
y = x ** 2

df = pd.DataFrame(data=y, index=x)

fig, ax = plt.subplots()
ax.plot(df)
ax.set_xlabel("X Label")
ax.set_ylabel("Y Label")
ax.set_title("Title")
fig.show()
plt.waitforbuttonpress()

Notice, that the DataFrame is created from the NumPy arrays. Hence, here we do not gain anything from using it. This is just to exemplify how easy it is to use s in an object oriented way with Pandas.

Final thoughts

I have found that programmer either hate or love Matplotlib. I do not always know why, but I have discovered that this non-object oriented way of using Matplotlib is annoying some programmers.

This is a good reason to hate it, but I would say that there are no good alternative to Matplotlib – or at least, they are build upon Matplotlib.

I like the power and ease using Matplotlib. I do like that the option of using it object oriented, which makes the code more intuitive and easier to understand for other programmers.

How To Extract Numbers From Strings in HTML Table and Export to Excel from Python

What will we cover in this tutorial?

How to import a HTML table to Excel.

But that is easy? You can do that directly from Excel.

Yes, but what if entries contains numbers and string together, then the import will convert it to a string and makes it difficult to get the number extracted from the string.

Luckily, we will cover how to do that easy with Python.

Step 1: Get the dataset

Find your favorite HTML table online. For the purpose of this tutorial I will use this one from Wikipedia with List of Metro Systems.

View of HTML table of interest

Say, what if we wanted to sum how many stations are in this table (please notice that the table contains more rows than shown in the above picture).

If you import that directly into Excel, with the import functionality you will realize that the column of stations will be interpreted as strings. The problem is, that it will look like 19[13], while we are only interested in the number 19.

There is no build in functionality to do that directly in Excel.

But let’s try to import this into Python. We will use Pandas to do that. If you are new to Pandas, please see this tutorial.

import pandas as pd


url = "https://en.wikipedia.org/wiki/List_of_metro_systems"
tables = pd.read_html(url)

print(tables[0].head())

Which will result in the following output.

/Users/admin/PycharmProjects/LearningSpace/venv/bin/python /Users/admin/PycharmProjects/LearningSpace/test.py
           City    Country  ...          System length Annual ridership(millions)
0       Algiers    Algeria  ...  18.5 km (11.5 mi)[14]           45.3 (2019)[R 1]
1  Buenos Aires  Argentina  ...  56.7 km (35.2 mi)[16]          337.7 (2018)[R 2]
2       Yerevan    Armenia  ...   13.4 km (8.3 mi)[17]           20.2 (2019)[R 3]
3        Sydney  Australia  ...  36 km (22 mi)[19][20]  14.2 (2019) [R 4][R Nb 1]
4        Vienna    Austria  ...  83.3 km (51.8 mi)[21]          459.8 (2019)[R 6]

Where we have the same problem. If we inspect the type of the columns we get the following.

City                          object
Country                       object
Name                          object
Yearopened                    object
Year of lastexpansion         object
Stations                      object
System length                 object
Annual ridership(millions)    object
dtype: object

Where actually all columns are of type object, which here is equivalent to a string.

Step 2: Extract the numbers from Stations and System length column

The DataStructure of the tables in tables is a DataFrame, which is Pandas main data structure.

As the strings we want to convert from string to integers are containing more information than just the numbers, we cannot use the DataFrame method to_numeric().

We want to convert something of the form 19[13] to 19.

To do that easily, we will use the apply(…) method on the DataFrame.

The apply-method takes a function as argument and applies it on each row.

We will use a lambda function as argument. If you are not familiar with lambda functions, please read this tutorial.

import pandas as pd


url = "https://en.wikipedia.org/wiki/List_of_metro_systems"
tables = pd.read_html(url)
table = tables[0]

table['Stations'] = table.apply(lambda row: int(row['Stations'].split('[')[0]), axis=1)
table['System length'] = table.apply(lambda row: float(row['System length'].split()[0]), axis=1)

print(table[['Stations', 'System length']].head())

Which will result in the following output.

   Stations  System length
0        19           18.5
1        90           56.7
2        10           13.4
3        13           36.0
4        98           83.3

This is what we want.

Step 3: Export to Excel

Wow. This needs an entire step?

Well, of course it does.

Here we need to unleash the power of Pandas and use the to_excel(…) method.

import pandas as pd


url = "https://en.wikipedia.org/wiki/List_of_metro_systems"
tables = pd.read_html(url)
table = tables[0]

table['Stations'] = table.apply(lambda row: int(row['Stations'].split('[')[0]), axis=1)
table['System length'] = table.apply(lambda row: float(row['System length'].split()[0]), axis=1)

table.to_excel('output.xlsx')

This will result in an Excel file looking similar to this, where the Stations and System length columns are numeric and not string.

Excel file now with Stations and System length as numbers and not strings

What’s next?

Want to learn more about Python and Excel?

Check out my online guide.

How to Export Pandas DataFrame to Excel and Create a Trendline Graph of Scatter Plot

What will we cover in this tutorial?

We will have some data in a Pandas DataFrame, which we want to export to an Excel sheet. Then we want to create a Scatter plot graph and fit that to a Excel trendline.

Step 1: Get the data

You might have some data already that you want to use. It can be from a HTML page (example) or CSV file.

For this purpose here we just generate some random data to use. We will use NumPy’s uniform function to generate it.

import pandas as pd
import numpy as np


# Generate some random increasing data
data = pd.DataFrame(
    {'A': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)],
     'B': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)]}
)
print(data)

Which will generate some slightly increasing data, which is nice to fit a graph to.

The output could look something like this.

            A          B
0    0.039515   0.778077
1    0.451888   0.210705
2    0.992493   0.961428
3    0.317536   1.046444
4    1.220419   1.388086

Step 2: Create an Excel XlsxWriter engine

This step might require that you install the XlsxWriter library, which is needed from the Pandas library.

This can be done by the following command.

pip install xlsxwriter

Now we can create the engine in our code.

import pandas as pd
import numpy as np


# Generate some random increasing data
data = pd.DataFrame(
    {'A': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)],
     'B': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)]}
)

# Create a Pandas Excel writer using XlsxWriter
excel_file = 'output.xlsx'
sheet_name = 'Data set'
writer = pd.ExcelWriter(excel_file, engine='xlsxwriter')

This will setup a Excel writer engine and be ready to write to file output.xlsx.

Step 3: Write the data to Excel and create a scatter graph with a fitted Trendline

This can be done by the following code, which uses the add_series function to insert a graph.

import pandas as pd
import numpy as np


# Generate some random increasing data
data = pd.DataFrame(
    {'A': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)],
     'B': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)]}
)

# Create a Pandas Excel writer using XlsxWriter
excel_file = 'output.xlsx'
sheet_name = 'Data set'
writer = pd.ExcelWriter(excel_file, engine='xlsxwriter')
data.to_excel(writer, sheet_name=sheet_name)

# Access the XlsxWriter workbook and worksheet objects from the dataframe.
workbook = writer.book
worksheet = writer.sheets[sheet_name]

# Create a scatter chart object.
chart = workbook.add_chart({'type': 'scatter'})

# Get the number of rows and column index
max_row = len(data)
col_x = data.columns.get_loc('A') + 1
col_y = data.columns.get_loc('B') + 1

# Create the scatter plot, use a trendline to fit it
chart.add_series({
    'name':       "Samples",
    'categories': [sheet_name, 1, col_x, max_row, col_x],
    'values':     [sheet_name, 1, col_y, max_row, col_y],
    'marker':     {'type': 'circle', 'size': 4},
    'trendline': {'type': 'linear'},
})

# Set name on axis
chart.set_x_axis({'name': 'Concentration'})
chart.set_y_axis({'name': 'Measured',
                  'major_gridlines': {'visible': False}})

# Insert the chart into the worksheet in field D2
worksheet.insert_chart('D2', chart)

# Close and save the Excel file
writer.save()

Result

The result should be similar to this.

The resulting Excel sheet.

That is how it can be done.

Pandas + GeoPandas + OpenCV: Create a Video of COVID-19 World Map

What will we cover?

How to create a video like the one below using Pandas + GeoPandas + OpenCV in Python.

  1. How to collect newest COVID-19 data in Python using Pandas.
  2. Prepare data and calculate values needed to create Choropleth map
  3. Get Choropleth map from GeoPandas and prepare to combine it
  4. Get the data frame by frame to the video
  5. Combine it all to a video using OpenCV

Step 1: Get the daily reported COVID-19 data world wide

This data is available from the European Centre for Disease Prevention and Control and can be found here.

All we need is to download the csv file, which has all the historic data from all the reported countries.

This can be done as follows.

import pandas as pd


# Just to get more rows, columns and display width
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 300)
pd.set_option('display.width', 1000)

# Get the updated data
table = pd.read_csv("https://opendata.ecdc.europa.eu/covid19/casedistribution/csv")

print(table)

This will give us an idea of how the data is structured.

          dateRep  day  month  year  cases  deaths countriesAndTerritories geoId countryterritoryCode  popData2019 continentExp  Cumulative_number_for_14_days_of_COVID-19_cases_per_100000
0      01/10/2020    1     10  2020     14       0             Afghanistan    AF                  AFG   38041757.0         Asia                                           1.040961         
1      30/09/2020   30      9  2020     15       2             Afghanistan    AF                  AFG   38041757.0         Asia                                           1.048847         
2      29/09/2020   29      9  2020     12       3             Afghanistan    AF                  AFG   38041757.0         Asia                                           1.114565         
3      28/09/2020   28      9  2020      0       0             Afghanistan    AF                  AFG   38041757.0         Asia                                           1.343261         
4      27/09/2020   27      9  2020     35       0             Afghanistan    AF                  AFG   38041757.0         Asia                                           1.540413         
...           ...  ...    ...   ...    ...     ...                     ...   ...                  ...          ...          ...                                                ...         
46221  25/03/2020   25      3  2020      0       0                Zimbabwe    ZW                  ZWE   14645473.0       Africa                                                NaN         
46222  24/03/2020   24      3  2020      0       1                Zimbabwe    ZW                  ZWE   14645473.0       Africa                                                NaN         
46223  23/03/2020   23      3  2020      0       0                Zimbabwe    ZW                  ZWE   14645473.0       Africa                                                NaN         
46224  22/03/2020   22      3  2020      1       0                Zimbabwe    ZW                  ZWE   14645473.0       Africa                                                NaN         
46225  21/03/2020   21      3  2020      1       0                Zimbabwe    ZW                  ZWE   14645473.0       Africa                                                NaN         

[46226 rows x 12 columns]

First we want to convert the dateRep to a date object (cannot be seen in the above, but the dates are represented by a string). Then use that as index for easier access later.

import pandas as pd


# Just to get more rows, columns and display width
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 300)
pd.set_option('display.width', 1000)

# Get the updated data
table = pd.read_csv("https://opendata.ecdc.europa.eu/covid19/casedistribution/csv")

# Convert dateRep to date object
table['date'] = pd.to_datetime(table['dateRep'], format='%d/%m/%Y')
# Use date for index
table = table.set_index('date')

Step 2: Prepare data and compute values needed for plot

What makes sense to plot?

Good question. In a Choropleth map you will color according to a value. Here we will color in darker red the higher the value a country is represented with.

If we plotted based on number new COVID-19 cases, this would be high for countries with high populations. Hence, the number of COVID-19 cases per 100,000 people is used.

Using new COVID-19 cases per 100,000 people can be volatile and change drastic from day to day. To even that out, a 7 days rolling sum can be used. That is, you take the sum of the last 7 days and continue that process through your data.

To make it even less volatile, the average of the last 14 days of the 7 days rolling sum is used.

And no, it is not just something invented by me. It is used by the authorities in my home country to calculate rules of which countries are open for travel or not.

This can by the data above be calculated by computing that data.

def get_stat(country_code, table):
    data = table.loc[table['countryterritoryCode'] == country_code]
    data = data.reindex(index=data.index[::-1])
    data['7 days sum'] = data['cases'].rolling(7).sum()
    data['7ds/100000'] = data['7 days sum'] * 100000 / data['popData2019']
    data['14 mean'] = data['7ds/100000'].rolling(14).mean()
    return data

The above function takes the table we returned from Step 1 and extract a country based on a country code. Then it reverses the data to have the dates in chronological order.

After that, it computes the 7 days rolling sum. Then computes the new cases by the population in the country in size of 100,000 people. Finally, it computes the 14 days average (mean) of it.

Step 3: Get the Choropleth map data and prepare it

GeoPandas is an amazing library to create Choropleth maps. But it does need your attention when you combine it with other data.

Here we want to combine it with the country codes (ISO_A3). If you inspect the data, some of the countries are missing that data.

Other than that the code is straight forward.

import pandas as pd
import geopandas


# Just to get more rows, columns and display width
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 300)
pd.set_option('display.width', 1000)

# Get the updated data
table = pd.read_csv("https://opendata.ecdc.europa.eu/covid19/casedistribution/csv")

# Convert dateRep to date object
table['date'] = pd.to_datetime(table['dateRep'], format='%d/%m/%Y')
# Use date for index
table = table.set_index('date')


def get_stat(country_code, table):
    data = table.loc[table['countryterritoryCode'] == country_code]
    data = data.reindex(index=data.index[::-1])
    data['7 days sum'] = data['cases'].rolling(7).sum()
    data['7ds/100000'] = data['7 days sum'] * 100000 / data['popData2019']
    data['14 mean'] = data['7ds/100000'].rolling(14).mean()
    return data


# Read the data to make a choropleth map
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
world = world[(world.pop_est > 0) & (world.name != "Antarctica")]

# Store data per country to make it easier
data_by_country = {}

for index, row in world.iterrows():
    # The world data is not fully updated with ISO_A3 names
    if row['iso_a3'] == '-99':
        country = row['name']
        if country == "Norway":
            world.at[index, 'iso_a3'] = 'NOR'
            row['iso_a3'] = "NOR"
        elif country == "France":
            world.at[index, 'iso_a3'] = 'FRA'
            row['iso_a3'] = "FRA"
        elif country == 'Kosovo':
            world.at[index, 'iso_a3'] = 'XKX'
            row['iso_a3'] = "XKX"
        elif country == "Somaliland":
            world.at[index, 'iso_a3'] = '---'
            row['iso_a3'] = "---"
        elif country == "N. Cyprus":
            world.at[index, 'iso_a3'] = '---'
            row['iso_a3'] = "---"

    # Add the data for the country
    data_by_country[row['iso_a3']] = get_stat(row['iso_a3'], table)

This will create a dictionary (data_by_country) with the needed data for each country. Notice, we do it like this, because not all countries have the same number of data points.

Step 4: Create a Choropleth map for each date and save it as an image

This can be achieved by using matplotlib.

The idea is to go through all dates and look for each country if they have data for that date and use it if they have.

import pandas as pd
import geopandas
import matplotlib.pyplot as plt


# Just to get more rows, columns and display width
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 300)
pd.set_option('display.width', 1000)

# Get the updated data
table = pd.read_csv("https://opendata.ecdc.europa.eu/covid19/casedistribution/csv")

# Convert dateRep to date object
table['date'] = pd.to_datetime(table['dateRep'], format='%d/%m/%Y')
# Use date for index
table = table.set_index('date')


def get_stat(country_code, table):
    data = table.loc[table['countryterritoryCode'] == country_code]
    data = data.reindex(index=data.index[::-1])
    data['7 days sum'] = data['cases'].rolling(7).sum()
    data['7ds/100000'] = data['7 days sum'] * 100000 / data['popData2019']
    data['14 mean'] = data['7ds/100000'].rolling(14).mean()
    return data


# Read the data to make a choropleth map
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
world = world[(world.pop_est > 0) & (world.name != "Antarctica")]

# Store data per country to make it easier
data_by_country = {}

for index, row in world.iterrows():
    # The world data is not fully updated with ISO_A3 names
    if row['iso_a3'] == '-99':
        country = row['name']
        if country == "Norway":
            world.at[index, 'iso_a3'] = 'NOR'
            row['iso_a3'] = "NOR"
        elif country == "France":
            world.at[index, 'iso_a3'] = 'FRA'
            row['iso_a3'] = "FRA"
        elif country == 'Kosovo':
            world.at[index, 'iso_a3'] = 'XKX'
            row['iso_a3'] = "XKX"
        elif country == "Somaliland":
            world.at[index, 'iso_a3'] = '---'
            row['iso_a3'] = "---"
        elif country == "N. Cyprus":
            world.at[index, 'iso_a3'] = '---'
            row['iso_a3'] = "---"

    # Add the data for the country
    data_by_country[row['iso_a3']] = get_stat(row['iso_a3'], table)

# Create an image per date
for day in pd.date_range('12-31-2019', '10-01-2020'):
    print(day)
    world['number'] = 0.0
    for index, row in world.iterrows():
        if day in data_by_country[row['iso_a3']].index:
            world.at[index, 'number'] = data_by_country[row['iso_a3']].loc[day]['14 mean']

    world.plot(column='number', legend=True, cmap='OrRd', figsize=(15, 5))
    plt.title(day.strftime("%Y-%m-%d"))
    plt.savefig(f'image-{day.strftime("%Y-%m-%d")}.png')
    plt.close()

This will create an image for each day. These images will be combined.

Step 5: Create a video from images with OpenCV

Using OpenCV to create a video from a sequence of images is quite easy. The only thing you need to ensure is that it reads the images in the correct order.

import cv2
import glob

img_array = []
filenames = glob.glob('image-*.png')
filenames.sort()
for filename in filenames:
    print(filename)
    img = cv2.imread(filename)
    height, width, layers = img.shape
    size = (width, height)
    img_array.append(img)

out = cv2.VideoWriter('covid.avi', cv2.VideoWriter_fourcc(*'DIVX'), 15, size)

for i in range(len(img_array)):
    out.write(img_array[i])
out.release()

Where we use the VideoWriter from OpenCV.

This results in this video.

Performance comparison of Numba vs Vectorization vs Lambda function with NumPy

What will we cover in this tutorial?

We will continue our investigation of Numba from this tutorial.

Numba is a just-in-time compiler for Python that works amazingly with NumPy. As we saw in the last tutorial, the built in vectorization can depending on the case and size of instance be faster than Numba.

Here we will explore that further as well to see how Numba compares with lambda functions. Lambda functions has the advantage, that they can be parsed as an argument down to a library that can optimize the performance and not depend on slow Python code.

Step 1: Example of Vectorization slower than Numba

In the previous tutorial we only investigated an example of vectorization, which was faster than Numba. Here we will see, that this is not always the case.

import numpy as np
from numba import jit
import time

size = 100
x = np.random.rand(size, size)
y = np.random.rand(size, size)
iterations = 100000


@jit(nopython=True)
def add_numba(a, b):
    c = np.zeros(a.shape)
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            c[i, j] = a[i, j] + b[i, j]
    return c


def add_vectorized(a, b):
    return a + b


# We call the function once, to precompile the code
z = add_numba(x, y)
start = time.time()
for _ in range(iterations):
    z = add_numba(x, y)
end = time.time()
print("Elapsed (numba, precompiled) = %s" % (end - start))

start = time.time()
for _ in range(iterations):
    z = add_vectorized(x, y)
end = time.time()
print("Elapsed (vectorized) = %s" % (end - start))

Varying the size of the NumPy array, we can see the performance between the two in the graph below.

Where it is clear that the vectorized approach is slower.

Step 2: Try some more complex example comparing vectorized and Numba

A if-then-else can be expressed as vectorized using the Numpy where function.

import numpy as np
from numba import jit
import time


size = 1000
x = np.random.rand(size, size)
iterations = 1000


@jit(nopython=True)
def numba(a):
    c = np.zeros(a.shape)
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            if a[i, j] < 0.5:
                c[i, j] = 1
    return c


def vectorized(a):
    return np.where(a < 0.5, 1, 0)


# We call the numba function to precompile it before we measure it
z = numba(x)
start = time.time()
for _ in range(iterations):
    z = numba(x)
end = time.time()
print("Elapsed (numba, precompiled) = %s" % (end - start))

start = time.time()
for _ in range(iterations):
    z = vectorized(x)
end = time.time()
print("Elapsed (vectorized) = %s" % (end - start))

This results in the following comparison.

That is close, but the vectorized approach is a bit faster.

Step 3: Compare Numba with lambda functions

I am very curious about this. Lambda functions are controversial in Python, and many are not happy about them as they have a lot of syntax, which is not aligned with Python. On the other hand, lambda functions have the advantage that you can send them down in the library that can optimize over the for-loops.

import numpy as np
from numba import jit
import time

size = 1000
x = np.random.rand(size, size)
iterations = 1000


@jit(nopython=True)
def numba(a):
    c = np.zeros((size, size))
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            c[i, j] = a[i, j] + 1
    return c


def lambda_run(a):
    return a.apply(lambda x: x + 1)


# Call the numba function to precompile it before time measurement
z = numba(x)
start = time.time()
for _ in range(iterations):
    z = numba(x)
end = time.time()
print("Elapsed (numba, precompiled) = %s" % (end - start))

start = time.time()
for _ in range(iterations):
    z = vectorized(x)
end = time.time()
print("Elapsed (vectorized) = %s" % (end - start))

Resulting in the following performance comparison.

This is again tight, but the lambda approach is still a bit faster.

Remember, this is a simple lambda function and we cannot conclude that lambda function in general are faster than using Numba.

Conclusion

Learnings since the last tutorial is that we have found an example where simple vectorization is slower than Numba. This still leads to the conclusion that performance highly depends on the task. Further, the lambda function seems to give promising performance. Again, this should be compared to the slow approach of a Python for-loop without Numba just-in-time compiled machine code.

When to use Numba with Python NumPy: Vectorization vs Numba

What will we cover in this tutorial?

You just want your code to run fast, right? Numba is a just-in-time compiler for Python that works amazingly with NumPy. Does that mean we should alway use Numba?

Well, let’s try some examples out and learn. If you know about NumPy, you know you should use vectorization to get speed. Does Numba beat that?

Step 1: Let’s learn how Numba works

Numba will compile the Python code into machine code and run it. What about the just-in-time compiler? That means, the first time it uses the code you want to turn into machine code, it will compile it and run it. The next, or any time later, it will just run it, as it is already compiled.

Let’s try that.

import numpy as np
from numba import jit
import time


@jit(nopython=True)
def full_sum_numba(a):
    sum = 0.0
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            sum += a[i, j]
    return sum


iterations = 1000
size = 10000
x = np.random.rand(size, size)

start = time.time()
full_sum_numba(x)
end = time.time()
print("Elapsed (Numba) = %s" % (end - start))

start = time.time()
full_sum_numba(x)
end = time.time()
print("Elapsed (Numba) = %s" % (end - start))

Where you get.

Elapsed (No Numba) = 0.41634082794189453
Elapsed (No Numba) = 0.11176300048828125

Where you see a difference in runtime.

Oh, did you get what happened in the code? Well, if you put @jit(nopython=True) in front of a function, Numba will try to compile it and run it as machine code.

As you see above, the first time as has an overhead in run-time, because it first compiles and the runs it. The second time, it already has compiled it and can run it immediately.

Step 2: Compare Numba just-in-time code to native Python code

So let us compare how much you gain by using Numba just-in-time (@jit) in our code.

import numpy as np
from numba import jit
import time


def full_sum(a):
    sum = 0.0
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            sum += a[i, j]
    return sum


@jit(nopython=True)
def full_sum_numba(a):
    sum = 0.0
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            sum += a[i, j]
    return sum


iterations = 1000
size = 10000
x = np.random.rand(size, size)

start = time.time()
full_sum(x)
end = time.time()
print("Elapsed (No Numba) = %s" % (end - start))

start = time.time()
full_sum_numba(x)
end = time.time()
print("Elapsed (Numba) = %s" % (end - start))

start = time.time()
full_sum_numba(x)
end = time.time()
print("Elapsed (Numba) = %s" % (end - start))

Here we added a native Python function without the @jit in front and will compare it with one which has. We will compare it here.

Elapsed (No Numba) = 38.08543515205383
Elapsed (No Numba) = 0.41634082794189453
Elapsed (No Numba) = 0.11176300048828125

That is some difference. Also, we have plotted a few more runs in the graph below.

It seems pretty evident.

Step 3: Comparing it with Vectorization

If you don’t know what vectorization is, we can recommend this tutorial. The reason to have vectorization is to move the expensive for-loops into the function call to have optimized code run it.

That sounds a lot like what Numba can do. It can change the expensive for-loops into fast machine code.

But which one is faster?

Well, I think there are two parameters to try out. First, the size of the problem. Second, to see if the number of iterations matter.

import numpy as np
from numba import jit
import time


@jit(nopython=True)
def full_sum_numba(a):
    sum = 0.0
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            sum += a[i, j]
    return sum


def full_sum_vectorized(a):
    return a.sum()


iterations = 1000
size = 10000
x = np.random.rand(size, size)

start = time.time()
full_sum_vectorized(x)
end = time.time()
print("Elapsed (No Numba) = %s" % (end - start))

start = time.time()
full_sum_numba(x)
end = time.time()
print("Elapsed (No Numba) = %s" % (end - start))

start = time.time()
full_sum_numba(x)
end = time.time()
print("Elapsed (No Numba) = %s" % (end - start))

As a function of the size.

It is interesting that Numba is faster for small sized of the problem, while it seems like the vectorized approach outperforms Numba for bigger sizes.

And not surprisingly, the number of iterations only makes the difference bigger.

This is not surprising, as the code in a vectorized call can be more specifically optimized than the more general purpose Numba approach.

Conclusion

Does that mean the Numba does not pay off to use?

No, not at all. First of all, we have only tried it for one vectorized approach, which was obviously very easy to optimize. Secondly, not all loops can be turned into vectorized code. In general it is difficult to have a state in a vectorized approach. Hence, if you need to keep track of some internal state in a loop it can be difficult to find a vectorized approach.

Multiple Time Frame Analysis on a Stock using Pandas

What will we investigate in this tutorial?

A key element to success in trading is to understand the market and the trend of the stock before you buy it. In this tutorial we will not cover how to read the market, but take a top-down analysis approach to stock prices. We will use what is called Multiple Time Frame Analysis on a stock starting with a 1-month, 1-week, and 1-day perspective. Finally, we will compare that with a Simple Moving Average with a monthly view.

Step 1: Gather the data with different time frames

We will use the Pandas-datareader library to collect the time series of a stock. The library has an endpoint to read data from Yahoo! Finance, which we will use as it does not require registration and can deliver the data we need.

import pandas_datareader as pdr
import datetime as dt


ticker = "MSFT"
start = dt.datetime(2019, 1, 1)
end = dt.datetime.now()
day = pdr.get_data_yahoo(ticker, start, end, interval='d')
week = pdr.get_data_yahoo(ticker, start, end, interval='wk')
month = pdr.get_data_yahoo(ticker, start, end, interval='mo')

Where the key is to set the interval to ‘d’ (Day), ‘wk’ (Week), and ‘mo’ (Month).

This will give us 3 DataFrames, each indexed with different intervals.

Dayly.

                  High         Low  ...      Volume   Adj Close
Date                                ...                        
2019-01-02  101.750000   98.940002  ...  35329300.0   98.860214
2019-01-03  100.190002   97.199997  ...  42579100.0   95.223351
2019-01-04  102.510002   98.930000  ...  44060600.0   99.652115
2019-01-07  103.269997  100.980003  ...  35656100.0   99.779205
2019-01-08  103.970001  101.709999  ...  31514400.0  100.502670

Weekly.

                  High         Low  ...       Volume   Adj Close
Date                                ...                         
2019-01-01  103.269997   97.199997  ...  157625100.0   99.779205
2019-01-08  104.879997  101.260002  ...  150614100.0   99.769432
2019-01-15  107.900002  101.879997  ...  127262100.0  105.302940
2019-01-22  107.879997  104.660004  ...  142112700.0  102.731720
2019-01-29  106.379997  102.169998  ...  203449600.0  103.376968

Monthly.

                  High         Low  ...        Volume   Adj Close
Date                                ...                          
2019-01-01  107.900002   97.199997  ...  7.142128e+08  102.096245
2019-02-01  113.239998  102.349998  ...  4.690959e+08  109.526405
2019-03-01  120.820000  108.800003  ...  5.890958e+08  115.796768
2019-04-01  131.369995  118.099998  ...  4.331577e+08  128.226700
2019-05-01  130.649994  123.040001  ...  5.472188e+08  121.432449
2019-06-01  138.399994  119.010002  ...  5.083165e+08  132.012497

Step 2: Combine data and interpolate missing points

The challenge to connect the DataFrames is that they have different index entries. If we add the data points from Daily with Weekly, there will be a lot of missing entries that Daily has, but Weekly does not have.

                   day        week
Date                              
2019-01-02  101.120003         NaN
2019-01-03   97.400002         NaN
2019-01-04  101.930000         NaN
2019-01-07  102.059998         NaN
2019-01-08  102.800003  102.050003
...                ...         ...
2020-08-13  208.699997         NaN
2020-08-14  208.899994         NaN
2020-08-17  210.279999         NaN
2020-08-18  211.490005  209.699997
2020-08-19  209.699997  209.699997

To deal with that we can choose to interpolate by using the DataFrame interpolate function.

import pandas_datareader as pdr
import datetime as dt
import pandas as pd


ticker = "MSFT"
start = dt.datetime(2019, 1, 1)
end = dt.datetime.now()
day = pdr.get_data_yahoo(ticker, start, end, interval='d')
week = pdr.get_data_yahoo(ticker, start, end, interval='wk')
month = pdr.get_data_yahoo(ticker, start, end, interval='mo')

data = pd.DataFrame()
data['day'] = day['Close']
data['week'] = week['Close']
data['week'] = data['week'].interpolate(method='linear')
print(data)

Which results in the following output.

                   day        week
Date                              
2019-01-02  101.120003         NaN
2019-01-03   97.400002         NaN
2019-01-04  101.930000         NaN
2019-01-07  102.059998         NaN
2019-01-08  102.800003  102.050003
...                ...         ...
2020-08-13  208.699997  210.047998
2020-08-14  208.899994  209.931998
2020-08-17  210.279999  209.815997
2020-08-18  211.490005  209.699997
2020-08-19  209.699997  209.699997

Where the missing points (except the first entry) will be linearly put between. This can be done for months as well, but we need to be more careful because of three things. First, some dates (1st of the month) do not exist in the data DataFrame. To solve that we use an outer-join, which will include them. Second, this introduces some extra dates, which are not trading dates. Hence, we need to delete them afterwards, which we can do by deleting the column (drop) and removing rows with NA value (dropna). Thirdly, we also need to understand that the monthly view looks backwards. Hence, the 1st of January is first finalized the last day of January. Therefore we shift it back in the join.

import pandas_datareader as pdr
import datetime as dt
import pandas as pd


ticker = "MSFT"
start = dt.datetime(2019, 1, 1)
end = dt.datetime.now()
day = pdr.get_data_yahoo(ticker, start, end, interval='d')
week = pdr.get_data_yahoo(ticker, start, end, interval='wk')
month = pdr.get_data_yahoo(ticker, start, end, interval='mo')


data = pd.DataFrame()
data['day'] = day['Close']
data['week'] = week['Close']
data['week'] = data['week'].interpolate(method='index')
data = data.join(month['Close'].shift(), how='outer')
data['month'] = data['Close'].interpolate(method='index')
data = data.drop(columns=['Close']).dropna()
data['SMA20'] = data['day'].rolling(20).mean()

Step 3: Visualize the output and take a look at it

To visualize it is straight forward by using matplotlib.

import pandas_datareader as pdr
import datetime as dt
import matplotlib.pyplot as plt
import pandas as pd


ticker = "MSFT"
start = dt.datetime(2019, 1, 1)
end = dt.datetime.now()
day = pdr.get_data_yahoo(ticker, start, end, interval='d')
week = pdr.get_data_yahoo(ticker, start, end, interval='wk')
month = pdr.get_data_yahoo(ticker, start, end, interval='mo')


data = pd.DataFrame()
data['day'] = day['Close']
data['week'] = week['Close']
data['week'] = data['week'].interpolate(method='index')
data = data.join(month['Close'].shift(), how='outer')
data['month'] = data['Close'].interpolate(method='index')
data = data.drop(columns=['Close']).dropna()

data.plot()
plt.show()

Which results in the following graph.

As expected the monthly price is adjusted to be the closing day-price the day before. Hence, it looks like the monthly-curve is crossing the day-curve on the 1st every month (which is almost true).

To really appreciate the Multiple Time Frames Analysis, it is better to keep the graphs separate and interpret them each isolated.

Step 4: How to use these different Multiple Time Frame Analysis

Given the picture it is a good idea to start top down. First look at the monthly picture, which shows the overall trend.

Month view of MFST.

In the case of MSFT it is a clear growing trend, with the exception of two declines. But the overall impression is a company in growth that does not seem to slow down. Even the Dow theory (see this tutorial on it) suggest that there will be secondary movements in a general bull trend.

Secondly, we will look at the weekly view.

Weekly view of MFST

Here your impression is a bit more volatile. It shows many smaller ups and downs, with a big one in March, 2020. It could also indicate a small decline in the growth right and the end. Also, the Dow theory could suggest that it will turn. Though it is not certain.

Finally, on the daily view it gives a more volatile picture, which can be used to when to enter the market.

Day view of MFST

Here you could also be a bit worried. Is this the start of a smaller bull market.

To sum up. In the month-view, we have concluded a growth. The week-view shows signs of possible change. Finally, the day-view is also showing signs of possible decline.

As an investor, and based on the above, I would not enter the market right now. If both the month-view and week-view showed growth, while the day-view decline, that would be a good indicator. You want the top level to show growth, while a day-view might show a small decline.

Finally, remember that you should not just use one way to interpret to enter the market or not.

Step 5: Is monthly the same as a Simple Moving Average?

Good question, I am glad you asked. The Simple Moving Average (SMA) can be calculated easy with DataFrames using rolling and mean function.

Best way is to just try it.

import pandas_datareader as pdr
import datetime as dt
import matplotlib.pyplot as plt
import pandas as pd


ticker = "MSFT"
start = dt.datetime(2019, 1, 1)
end = dt.datetime.now()
day = pdr.get_data_yahoo(ticker, start, end, interval='d')
week = pdr.get_data_yahoo(ticker, start, end, interval='wk')
month = pdr.get_data_yahoo(ticker, start, end, interval='mo')


data = pd.DataFrame()
data['day'] = day['Close']
data['week'] = week['Close']
data['week'] = data['week'].interpolate(method='index')
data = data.join(month['Close'].shift(), how='outer')
data['month'] = data['Close'].interpolate(method='index')
data = data.drop(columns=['Close']).dropna()
data['SMA20'] = data['day'].rolling(20).mean()

data.plot()
plt.show()

As you see, the SMA is not as reactive on the in crisis in March, 2020, as the monthly view is. This shows a difference in them. This does not exclude the one from the other, but shows a difference in how they react.

Comparing the month-view with a Simple Moving Average of a month (20 trade days)

Please remember, that the monthly view is first updated at the end of a month, while SMA is updated on a daily basis.

Other differences is that SMA is an average of the 20 last days, while the monthly is the actual value of the last day of a month (as we look at Close). This implies that the monthly view can be much more volatile than the SMA.

Conclusion

It is advised to make analysis from bigger time frames and zoom in. This way you first look at overall trends, and get a bigger picture of the market. This should eliminate not to fall into being focused on a small detail in the market, but understand it on a higher level.