Visualize Why Long-term Investing is Less Risky – Pandas and Matplotlib

What will we cover in this tutorial?

We will look at how you can use Pandas Datareader (Pandas) and Matplotlib to create a visualization of why long-term investing is less risky.

Here risk is simply meaning the risk of loosing money.

Specifically, we will investigate how likely it is to loose money (and how much) if you invest for a 1 year perspective vs a 10 year perspective.

Step 1: Establish the data for the investigation

One of the most widely used index is the S&P 500 index. This index lists 500 large companies on the US market exchange and is one of the most commonly followed equity indices.

We will use this index and retrieve data back from 1970 and up until today.

This can be done as follow.

import pandas_datareader as pdr
from datetime import datetime

data = pdr.get_data_yahoo('^GSPC', datetime(1970, 1, 1))

Then the DataFrame data will contain all data from 1970 up until today. The ^GSPC is the ticker for the S&P 500 index.

Step 2: Calculate the annual return from 1970 and forward using Pandas

The annual return for a year is calculated by taking the last trading value of the divided by the first day and subtracting 1, then multiply that by 100 to get it in percentage.

Calculating it for all years then you can visualize it with a histogram as follows.

import pandas as pd
import pandas_datareader as pdr
from datetime import datetime
import matplotlib.pyplot as plt


data = pdr.get_data_yahoo('^GSPC', datetime(1970, 1, 1))

years = []
annual_return = []

for year in range(1970, 2021):
    years.append(year)
    data_year = data.loc[f'{year}']['Adj Close']
    annual_return.append((data_year.iloc[-1] / data_year.iloc[0] - 1) * 100)

df = pd.DataFrame(annual_return, index=years)
bins = [i for i in range(-40, 45, 5)]
df.plot.hist(bins=bins, title='1 year')
plt.show()

Notice that we create a new DataFrame with all the annual returns for each of the years and use it to make a histogram.

The result is as follows.

What you see is a histogram indicating how many years a given annual return was occurring.

Hence, a -40-35% (negative) return occurred once, while a 0-5% return happened 6 times in the span of years from 1970 to 2020 (inclusive).

What does this tell us?

Well, you can lose up to 40%, but you can also gain up to 35% in one year. It also shows you that it is more likely to gain (positive return) than lose.

But what if we invested the money for 10 years.

Step 3: Calculate the average annual return in 10 years spans starting from 1970 using Pandas

This is actually quite similar, but with a few changes.

First of all, the average return is calculated using the CAGR (Compound Annual Growth Rate) formula.

This results in the following code.

import pandas as pd
import pandas_datareader as pdr
from datetime import datetime
import matplotlib.pyplot as plt


data = pdr.get_data_yahoo('^GSPC', datetime(1970, 1, 1))

years = []
avg_annual_return = []
for year in range(1970, 2011):
    years.append(year)
    data_year = data.loc[f'{year}':f'{year + 9}']['Adj Close']
    avg_annual_return.append(((data_year.iloc[-1] / data_year.iloc[0]) ** (1 / 10) - 1) * 100)

df = pd.DataFrame(avg_annual_return, index=years)
bins = [i for i in range(-40, 45, 5)]
df.plot.hist(bins=bins, title='10 years')
plt.show()

There are a few changes. One is the formula for the average annual return (as stated above) and the other is that we use 10 years of data. Notice, that we only add 9 to the year. This is because that both years are inclusive.

This results in this histogram.

As you see. One in 3 cases there was a negative return over the a 10 year span. Also, the loss was only in the range -5-0%. Otherwise, the return would be positive.

Now is that nice?

Fix get_data_yahoo from Pandas Datareader

What will we cover?

If you use get_data_yahoo from Pandas Datareader and it suddenly stopped working, then we will look at how to fix.

The Error and Problem

Consider this code.

import pandas_datareader as pdr
from datetime import datetime

data = pdr.get_data_yahoo('^GSPC', datetime(1970, 1, 1))

It has been working up until now. But suddenly it writes.

Traceback (most recent call last):
  File "/Users/rune/PycharmProjects/TEST/test_yahoo.py", line 4, in <module>
    data = pdr.get_data_yahoo('^GSPC', datetime(1970, 1, 1))
  File "/Users/rune/PycharmProjects/TEST/venv/lib/python3.8/site-packages/pandas_datareader/data.py", line 86, in get_data_yahoo
    return YahooDailyReader(*args, **kwargs).read()
  File "/Users/rune/PycharmProjects/TEST/venv/lib/python3.8/site-packages/pandas_datareader/base.py", line 253, in read
    df = self._read_one_data(self.url, params=self._get_params(self.symbols))
  File "/Users/rune/PycharmProjects/TEST/venv/lib/python3.8/site-packages/pandas_datareader/yahoo/daily.py", line 153, in _read_one_data
    resp = self._get_response(url, params=params)
  File "/Users/rune/PycharmProjects/TEST/venv/lib/python3.8/site-packages/pandas_datareader/base.py", line 181, in _get_response
    raise RemoteDataError(msg)
pandas_datareader._utils.RemoteDataError: Unable to read URL: https://finance.yahoo.com/quote/^GSPC/history?period1=10800&period2=1627523999&interval=1d&frequency=1d&filter=history
Response Text:
b'<!DOCTYPE html>\n  <html lang="en-us"><head>\n  <meta http-equiv="content-type" content="text/html; charset=UTF-8">\n      <meta charset="utf-8">\n      <title>Yahoo</title>\n      <meta name="viewport" content="width=device-width,initial-scale=1,minimal-ui">\n      <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n      <style>\n  html {\n      height: 100%;\n  }\n  body {\n      background: #fafafc url(https://s.yimg.com/nn/img/sad-panda-201402200631.png) 50% 50%;\n      background-size: cover;\n      height: 100%;\n      text-align: center;\n      font: 300 18px "helvetica neue", helvetica, verdana, tahoma, arial, sans-serif;\n  }\n  table {\n      height: 100%;\n      width: 100%;\n      table-layout: fixed;\n      border-collapse: collapse;\n      border-spacing: 0;\n      border: none;\n  }\n  h1 {\n      font-size: 42px;\n      font-weight: 400;\n      color: #400090;\n  }\n  p {\n      color: #1A1A1A;\n  }\n  #message-1 {\n      font-weight: bold;\n      margin: 0;\n  }\n  #message-2 {\n      display: inline-block;\n      *display: inline;\n      zoom: 1;\n      max-width: 17em;\n      _width: 17em;\n  }\n      </style>\n  <script>\n    document.write(\'<img src="//geo.yahoo.com/b?s=1197757129&t=\'+new Date().getTime()+\'&src=aws&err_url=\'+encodeURIComponent(document.URL)+\'&err=%<pssc>&test=\'+encodeURIComponent(\'%<{Bucket}cqh[:200]>\')+\'" width="0px" height="0px"/>\');var beacon = new Image();beacon.src="//bcn.fp.yahoo.com/p?s=1197757129&t="+new Date().getTime()+"&src=aws&err_url="+encodeURIComponent(document.URL)+"&err=%<pssc>&test="+encodeURIComponent(\'%<{Bucket}cqh[:200]>\');\n  </script>\n  </head>\n  <body>\n  <!-- status code : 404 -->\n  <!-- Not Found on Server -->\n  <table>\n  <tbody><tr>\n      <td>\n      <img src="https://s.yimg.com/rz/p/yahoo_frontpage_en-US_s_f_p_205x58_frontpage.png" alt="Yahoo Logo">\n      <h1 style="margin-top:20px;">Will be right back...</h1>\n      <p id="message-1">Thank you for your patience.</p>\n      <p id="message-2">Our engineers are working quickly to resolve the issue.</p>\n      </td>\n  </tr>\n  </tbody></table>\n  </body></html>'

What to do?

The fix

There has been a breaking change and you need to update your Pandas Datareader.

You can upgrade to the newest version as follows.

pip install pandas_datareader --upgrade

It should update it to version 0.10.0 or later.

Then the code should work again.

The Best Trading Strategy Analyzed with Pandas

What will we cover in this tutorial?

Believe it or not? The there are so many promises of the best trading strategies out there.

In this one we will analyze a promising strategy and see whether it holds or not?

What is it?

The bullet proof strategy taken over 20 years span

This looks amazing, how can it perform this good? Well, simply said, it combines short selling and a safety margin to avoid loosing money.

On the surface, it seems to take precautions. But let’s try to examine it closer.

Step 1: Implementing the strategy

The strategy is to have to Moving Averages a 42 days (MA42) and 252 days (MA252), then a safety margin (or safety distance) involved.

The strategy is as follows:

  • When MA42 – MA252 > safety distance, then go long.
  • When MA252 – MA52 > safety distance, then go short.

To analyze and implement the Strategy, we need Pandas_datareader to read the historic ticker prices.

import pandas_datareader as pdr
from datetime import datetime

data = pdr.get_data_yahoo('^GSPC', datetime(2000, 1, 1), datetime(2020, 1, 1))

data['MA42'] = data['Adj Close'].rolling(42).mean()
data['MA252'] = data['Adj Close'].rolling(252).mean()

data['42-252'] = data['MA42'] - data['MA252']

sd = 50

data['Regime'] = np.where(data['42-252'] > sd, 1, 0)
data['Regime'] = np.where(data['42-252'] < -sd, -1, data['Regime'])

print(data['Regime'].value_counts())

We retrive data for S&P 500 (ticker: ^GSPC) by using pdr.get_data_yahoo(‘^GSPC’, datetime(2000, 1, 1), datetime(2020, 1, 1)) and get 20 years of data.

As you see, the moving average are calculate with data[‘Adj Close’].rolling(42).mean().

We use a safety distance (sd) is set to 50. This makes sure that we do not go long or short immediately, but only when we have some distance.

The Regime is the short and long signal. Short when -1 and long when 1. If 0, do nothing.

We can see an overview of how often we go long and short.

 1    2608
 0    1422
-1    1001

Step 2: Test the strategy

This is straight forward.

import pandas_datareader as pdr
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt


data = pdr.get_data_yahoo('^GSPC', datetime(2000, 1, 1), datetime(2020, 1, 1))

data['MA42'] = data['Adj Close'].rolling(42).mean()
data['MA252'] = data['Adj Close'].rolling(252).mean()

data['42-252'] = data['MA42'] - data['MA252']

sd = 50

data['Regime'] = np.where(data['42-252'] > sd, 1, 0)
data['Regime'] = np.where(data['42-252'] < -sd, -1, data['Regime'])

print(data['Regime'].value_counts())

data['Market'] = np.log(data['Adj Close'] / data['Adj Close'].shift(1))

data['Strategy'] = data['Regime'].shift(1) * data['Market']

data[['Market', 'Strategy']].cumsum().apply(np.exp).plot(grid=True, figsize=(8, 5))
plt.show()

To test the strategy you need to compare it. We compare it against the general market (The S&P 500).

We use the log-returns, which can later be accumulated by cumsum and applied with the exponential function to get the result again.

The Strategy uses the Regime, but shifted one day, as we need a day to react on the price, hence, data[‘Regime’].shift(1) * data[‘Market’].

Finally, we plot the Strategy against the Market.

That looks good, right?

Well, let’s look at it closer.

Step 3: Deeper analysis of the winning strategy

First a few notes on short selling in general.

As noted, the strategy utilizes short selling, which often is not advised for starting investors. First of all, it is often more expensive, as you loan.

Often there is a fee to short sell, and a interest on top of that. I have seen fees that are quite high and interest on 4% p.a. This makes it less attractive and needs to be considered.

But there is more, and this is why I always encourage people to make the analysis they see themselves. The above depends on the starting point.

Let’s take a different period: 2010-2020, then this picture arises.

The winning strategy does not look winning if you start 2010.

Really, you might ask?

Try it yourself. Often these extremely good strategies, looking too good to be true, are only good under certain circumstances. Like, here, where it depends on the starting point.

Okay, in the past, when you start in 2000, it would have been good. But not from 2010 and forward, then it looks like a loosing strategy.

What about if you start today?

If you ask me, this is speculating.

What is next?

Want to learn more?

Check out my popular e-book on backtesting strategies. It will teach how to make better analysis, what to look for, and how to evaluate it. This tutorial only touches a few points.

Export DataFrames to Excel with Charts in Multiple Sheets

What will we cover in this tutorial?

In this tutorial we will learn how to export Financial data from DataFrames (Pandas/Python) into an Excel sheet. It will be in multiple sheets with colored rows and charts. And of course, all automated from Python.

Step 1: Read financial data into a DataFrame

First we need some data. We will use the data from our CSV file. Feel free to use any other data. The CSV used here is available in my GitHub.

import pandas as pd
 
data = pd.read_csv("AAPL.csv", index_col=0, parse_dates=True)

Step 2: Calculate the Moving Average, MACD, and Stochastic Oscillator with DataFrames

Moving Average as done previously.

data['MA10'] = data['Close'].rolling(10).mean()

The MACD as calculated follows.

exp1 = data['Close'].ewm(span=12, adjust=False).mean()
exp2 = data['Close'].ewm(span=26, adjust=False).mean()
data['MACD'] = macd = exp1 - exp2
data['Signal line'] = exp3 = macd.ewm(span=9, adjust=False).mean()

The Stochastic Oscillator.

high14 = data['High'].rolling(14).max()
low14 = data['Low'].rolling(14).min()
data['%K'] = pct_k = (data['Close'] - low14)*100/(high14 - low14)
data['%D'] = pct_d = data['%K'].rolling(3).mean()

Step 3: Adjust the time period and reverse the DataFrame

Adjust the time period we need. This is needed as the first calculations will not be available (NaN) or incorrect.

data = data.loc['2020-01-01':]
data = data.iloc[::-1]

Also notice, we reverse the data by .iloc[::-1]. This is just to have the most recent data on the top of our Excel sheet.

Step 4: Export DataFrame to Excel with multiple sheets and chars

The for generating our Excel sheet we need the XlsxWriter.

If you don’t have it installed already you can install it by running this in a cell: !pip install XlsxWriter 

The code that generated the Excel sheet.

writer = pd.ExcelWriter("technical.xlsx", 
                        engine='xlsxwriter', 
                        date_format = 'yyyy-mm-dd', 
                        datetime_format='yyyy-mm-dd')
 
workbook = writer.book
 
# Create a format for a green cell
green_cell = workbook.add_format({
    'bg_color': '#C6EFCE',
    'font_color': '#006100'
})
 
# Create a format for a red cell
red_cell = workbook.add_format({
    'bg_color': '#FFC7CE',                            
    'font_color': '#9C0006'
})
 
# **
# ** MA
# **
sheet_name = 'MA10'
data[['Close', 'MA10']].to_excel(writer, sheet_name=sheet_name)
worksheet = writer.sheets[sheet_name]
 
# Set column width of Date
worksheet.set_column(0, 0, 15)
 
 
for col in range(1, 3):
    # Create a conditional formatted of type formula
    worksheet.conditional_format(1, col, len(data), col, {
        'type': 'formula',                                    
        'criteria': '=B2>=C2',
        'format': green_cell
    })
 
    # Create a conditional formatted of type formula
    worksheet.conditional_format(1, col, len(data), col, {
        'type': 'formula',                                    
        'criteria': '=B2<C2',
        'format': red_cell
    })
 
 
# Create a new chart object.
chart1 = workbook.add_chart({'type': 'line'})
 
# Add a series to the chart.
chart1.add_series({
        'name': 'AAPL',
        'categories': [sheet_name, 1, 0, len(data), 0],
        'values': [sheet_name, 1, 1, len(data), 1],
})
 
# Create a new chart object.
chart2 = workbook.add_chart({'type': 'line'})
 
# Add a series to the chart.
chart2.add_series({
        'name': sheet_name,
        'categories': [sheet_name, 1, 0, len(data), 0],
        'values': [sheet_name, 1, 2, len(data), 2],
})
 
# Combine and insert title, axis names
chart1.combine(chart2)
chart1.set_title({'name': sheet_name + " AAPL"})
chart1.set_x_axis({'name': 'Date'})
chart1.set_y_axis({'name': 'Price'})
 
# Insert the chart into the worksheet.
worksheet.insert_chart('E2', chart1)
 
 
# **
# ** MACD
# **
 
sheet_name = 'MACD'
data[['Close', 'MACD', 'Signal line']].to_excel(writer, sheet_name=sheet_name)
worksheet = writer.sheets[sheet_name]
 
# Set column width of Date
worksheet.set_column(0, 0, 15)
 
for col in range(1, 4):
    # Create a conditional formatted of type formula
    worksheet.conditional_format(1, col, len(data), col, {
        'type': 'formula',                                    
        'criteria': '=C2>=D2',
        'format': green_cell
    })
 
    # Create a conditional formatted of type formula
    worksheet.conditional_format(1, col, len(data), col, {
        'type': 'formula',                                    
        'criteria': '=C2<D2',
        'format': red_cell
    })
 
# Create a new chart object.
chart1 = workbook.add_chart({'type': 'line'})
 
# Add a series to the chart.
chart1.add_series({
        'name': 'MACD',
        'categories': [sheet_name, 1, 0, len(data), 0],
        'values': [sheet_name, 1, 2, len(data), 2],
})
 
# Create a new chart object.
chart2 = workbook.add_chart({'type': 'line'})
 
# Add a series to the chart.
chart2.add_series({
        'name': 'Signal line',
        'categories': [sheet_name, 1, 0, len(data), 0],
        'values': [sheet_name, 1, 3, len(data), 3],
})
 
# Combine and insert title, axis names
chart1.combine(chart2)
chart1.set_title({'name': sheet_name + " AAPL"})
chart1.set_x_axis({'name': 'Date'})
chart1.set_y_axis({'name': 'Value'})
 
# To set the labels on x axis not on 0
chart1.set_x_axis({
    'label_position': 'low',
    'num_font':  {'rotation': 45}
})
 
# Insert the chart into the worksheet.
worksheet.insert_chart('F2', chart1)
 
 
# **
# ** Stochastic
# **
 
sheet_name = 'Stochastic'
data[['Close', '%K', '%D']].to_excel(writer, sheet_name=sheet_name)
worksheet = writer.sheets[sheet_name]
 
# Set column width of Date
worksheet.set_column(0, 0, 15)
 
for col in range(1, 4):
    # Create a conditional formatted of type formula
    worksheet.conditional_format(1, col, len(data), col, {
        'type': 'formula',                                    
        'criteria': '=C2>=D2',
        'format': green_cell
    })
 
    # Create a conditional formatted of type formula
    worksheet.conditional_format(1, col, len(data), col, {
        'type': 'formula',                                    
        'criteria': '=C2<D2',
        'format': red_cell
    })
 
 
# Create a new chart object.
chart1 = workbook.add_chart({'type': 'line'})
 
# Add a series to the chart.
chart1.add_series({
        'name': '%K',
        'categories': [sheet_name, 1, 0, len(data), 0],
        'values': [sheet_name, 1, 2, len(data), 2],
})
 
# Create a new chart object.
chart2 = workbook.add_chart({'type': 'line'})
 
# Add a series to the chart.
chart2.add_series({
        'name': '%D',
        'categories': [sheet_name, 1, 0, len(data), 0],
        'values': [sheet_name, 1, 3, len(data), 3],
})
 
# Combine and insert title, axis names
chart1.combine(chart2)
chart1.set_title({'name': sheet_name + " AAPL"})
chart1.set_x_axis({'name': 'Date'})
chart1.set_y_axis({'name': 'Value'})
 
# To set the labels on x axis not on 0
chart1.set_x_axis({
    'label_position': 'low',
    'num_font':  {'rotation': 45}
})
 
# Insert the chart into the worksheet.
worksheet.insert_chart('F2', chart1)
 
# End of sheets
 
 
# Close
writer.close()

For a walkthrough of the code, please see the video to this lesson.

This will generate an Excel sheet in called technical.xlsx. It will contain 3 sheets (MA10, MACD, Stochastic Oscillator).

A sheet will look similar to this.

Nest steps?

Want to learn more?

This is part of the FREE online course on my page. No signup required and 2 hours of free video content with code and Jupyter Notebooks available on GitHub.

Follow the link and read more.

Stochastic Oscillator with Pandas DataFrames

What will we cover?

In this tutorial we will show how to calculate the Stochastic Oscillator with Pandas DataFrames.

Step 1: Retrieve the Data from CSV file with Pandas DataFrames

We first need to read the data from a CSV file into a DataFrame. You can get the CSV file from here or directly from Yahoo! Finance.

Alternatively you can you PandasDataframes as described in this tutorial.

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib notebook
 
data = pd.read_csv("AAPL.csv", index_col=0, parse_dates=True)

Step 2: Calculate the Stochastic Oscillator with Pandas DataFrames

The Stochastic Oscillator is defined as follows.

  • 14-high: Maximum of last 14 trading days
  • 14-low: Minimum of last 14 trading days
  • %K(Last Close – 14-low)*100 / (14-high – 14-low)
  • %D: Simple Moving Average of %K

That can be done as follows.

high14 = data['High'].rolling(14).max()
low14 = data['Low'].rolling(14).min()
data['%K'] = (data['Close'] - low14)*100/(high14 - low14)
data['%D'] = data['%K'].rolling(3).mean()

Notice, we only keep the %K and %D. The high14 and low14 are temporary variables to make our calculations easier to read.

Step 3: Visualize the Stochastic Oscillator with Matplotlib

To visualize it.

fig, ax = plt.subplots()
data[['%K', '%D']].loc['2020-11-01':].plot(ax=ax)
ax.axhline(80, c='r', alpha=0.3)
ax.axhline(20, c='r', alpha=0.3)
data['Close'].loc['2020-11-01':].plot(ax=ax, alpha=0.3, secondary_y=True)

Resulting in the following.

Next Steps?

Want to learn more?

This is part of the FREE online course on my page. No signup required and 2 hours of free video content with code and Jupyter Notebooks available on GitHub.

Follow the link and read more.

Calucate MACD with Pandas DataFrames

What will we cover?

In this tutorial we will calculate and visualize the MACD for a stock price.

Step 1: Retrieve stock prices into a DataFrame (Pandas)

Let’s get started. You can get the CSV file from here or get your own from Yahoo! Finance.

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib notebook
 
data = pd.read_csv("AAPL.csv", index_col=0, parse_dates=True)

Step 2: Calculate the MACD indicator with Pandas DataFrame

First we want to calcite the MACD.

The calculation (12-26-9 MACD (default)) is defined as follows.

  • MACD=12-Period EMA − 26-Period EMA
  • Singal line 9-Perioed EMA of MACD

Where EMA is the Exponential Moving Average we learned about in the last lesson.

exp1 = data['Close'].ewm(span=12, adjust=False).mean()exp2 = data['Close'].ewm(span=26, adjust=False).mean()data['MACD'] = exp1 - exp2data['Signal line'] = data['MACD'].ewm(span=9, adjust=False).mean()

Now that was simple, right?

Step 3: Visualize the MACD with matplotlib

To visualize it you can use the following with Matplotlib.

fig, ax = plt.subplots()
data[['MACD', 'Signal line']].plot(ax=ax)
data['Close'].plot(ax=ax, alpha=0.25, secondary_y=True)

Resulting in an output similar to this one.

Next Steps?

Want to learn more?

This is part of the FREE online course on my page. No signup required and 2 hours of free video content with code and Jupyter Notebooks available on GitHub.

Follow the link and read more.

DataFrame Calculations for Financial Analysis

What will we cover?

In this lesson we will learn how to add new columns calculated from values in other columns in our DataFrame. This is similar to calculate in Excel on data from different columns.

Then we will demonstrate some useful function when working with financial data.

Finally, we will show how to remove (or drop) columns from our DataFrame.

Step 1: Load the data.

As usual we need to load the data into our DataFrame. You can get the CSV file from here.

import pandas as pd
data = pd.read_csv("AAPL.csv", index_col=0, parse_dates=True)

It is always a good habit to inspect the data with data.head() (see the video lesson or the in the Notebook link below the video for expected output).

Step 2: Create new columns in a DataFrame (Pandas)

To create a new column in our data set simply write as follows.

data['Daily chg'] = data['Close'] - data['Open']

The above statement will create a new column named Daily chg with the difference between column Close and Open.

Similarly, you can create a column with the normalized data as follows.

data['Normalized'] = data['Close'] / data['Close'].iloc[0]

This is how easy it is to work with.

Step 3: Get min and max in DataFrame columns

To find the minimum of a column.

data['Close'].min()

This will find the minimal value of the column Close.

To find the index of the minimum value use the following.

data['Close'].argmin()

You can do similar things as the following shows.

data['Normalized'].min()
data['Normalized'].argmin()
data['Close'].max()
data['Close'].argmax()

Step 4: Get the mean value of a column in a DataFrame (Pandas)

To get the mean value of a column, simply use mean().

data['Close'].mean()

Step 5: Remove / Delete columns in a DataFrame

It is always good practice to remove the columns of data we do not intend to use anymore. This can be done by using drop().

data.drop(labels=['High', 'Low', 'Adj Close', 'Volume'], axis=1, inplace=True)

Where we use the following arguments.

  • labels=[‘High’, ‘Low’, ‘Adj Close’, ‘Volume’] sets the labels of the columns we want to remove.
  • axis=1 sets the axis of the labels. Default is 0, and will look for the labels on the index. While axis 1 is the column names.
  • inplace=True says it should actually remove the columns on the DataFrame we work on. Otherwise it will return a new DataFrame without the columns.

What is next?

Want to learn more?

This is part of the FREE online course on my page. No signup required and 2 hours of free video content with code and Jupyter Notebooks available on GitHub.

Follow the link and read more.

DataFrame Columns and Series for Financial Analysis

What will we cover?

In the first lesson we learnt how to load data into a DataFrame. This part will show how to work with each column in the DataFrame. The columns are represented by a different data type, called Series.

n this lesson we will learn how to make calculations on the columns. The columns are represented by a data type called Series.

Each column in a DataFrame is a Series and can be easily accessed. Also, it is easy to calculate new Series of data. This is similar to calculate now columns of data in an Excel sheet.

We will explore that and more in this lesson.

Step 1: Load the data

We will start by importing the data (CSV file available here).

import pandas as pd
data = pd.read_csv("AAPL.csv", index_col=0, parse_dates=True)

Step 2: Explore the data and data type

In the video we explore the data to ensure it is correct. You can do that by using data.head().

Then we investigate the data type of the columns of the DataFrame data.

data.dtypes

Which results in the following.

Open         float64
High         float64
Low          float64
Close        float64
Adj Close    float64
Volume         int64
dtype: object

This means shows that each column has one data type. Here Open is float64. This is one difference from Excel sheets, where each cell has a data type. The advantage of restricting a data type per column is speed.

The data type of data is DataFrame.

type(data)

The build in function type(…) gives you the type. It is handy to use it when exploring data.

pandas.core.frame.DataFrame

Notice that it is given by a long string pandas.core.frame.DataFrame, this is the structure of the library Pandas.

The data type of a column in a DataFrame can be found by.

type(data['Close'])

Where data[‘Close’] gives access to column Close in the DataFramedata.

pandas.core.series.Series

Where we see a column is represented as a Series. The is similar to a DataFrame that it has an index. E.g. the Series data[‘Close’] has the same index as the DataFrame data. This is handy when you need to work with the data as you will see in a moment.

Step 3: Calculating with Series

To keep it simple, we will start by the daily difference from open and close.

daily_chg = data['Open'] - data['Close']

This calculates a Series daily_chg with the opening price minus the closing price. 

Please explore the full data in daily_chg with the data in data.

A more advanced calculation is this one.

daily_pct_chg = (data['Close'] - data['Open'])/data['Open']*100

Where we calculate the daily percentage change. In the calculation above we have limited us to only use data on the same rows (same dates). Later we will learn how to do it with data from previous day (the row above).

Step 4: Normalize stock data

Now we will normalize the data by using the iloc we learned about in previous lesson.

norm = data['Close']/data['Close'].iloc[0]

The above statements calculates a Series norm where the Close price is normalized by dividing by the first available Close price, accessed by using iloc[0].

This results in that norm.iloc[0] will be 1.0000 and norm.iloc[-1] we show the return of this particular stock if invested in on day 1 (index 0) and sold on the day of the last index (index -1), in the case of the video: 1.839521.

Next step?

Want to learn more?

This is part of the FREE online course on my page. No signup required and 2 hours of free video content with code and Jupyter Notebooks available on GitHub.

Follow the link and read more.

Pandas for Financial Stock Analysis

What will we cover?

In this tutorial we will get familiar to work with DataFrames – the primary data structure in Pandas.

We will learn how to read a historical stock price data from Yahoo! Finance and load it into a DataFrame. This will be done by exporting a CSV file from Yahoo! Finance and load the data. Later we will learn how to read the data directly from the Yahoo! Finance API.

DataFrame is similar to an Excel sheet. DataFrames can contain data in a similar way as we will see in this lesson.

Then we will learn how to use the index of the dates. This will be necessary later when we make calculations later on.

The first part of the tutorial will give the foundation of what you need to know about DataFrames for financial analysis.

Step 1: Read the stock prices from Yahoo! Finance as CSV

In this first lesson we will download historical stock prices from Yahoo! Finance as CSV file and import them into our Jupyter notebook environment in a DataFrame.

If you are new to CSV files and DataFrames. Don’t worry, that is what we will cover here.

Let’s start by going to Yahoo! Finance and download the CVS file. In this course we have used Apple, but feel free to make similar calculation on a stock of your choice.

Go to Yahoo! Finance write AAPL (ticker for Apple) and press Historical Data and download the CSV data file.

The CSV data file will contain Comma Separated Values (CSV) similar to this.

Date,Open,High,Low,Close,Adj Close,Volume
2020-03-02,70.570000,75.360001,69.430000,74.702499,74.127892,341397200
2020-03-03,75.917503,76.000000,71.449997,72.330002,71.773636,319475600
2020-03-04,74.110001,75.849998,73.282501,75.684998,75.102829,219178400
2020-03-05,73.879997,74.887497,72.852501,73.230003,72.666725,187572800
2020-03-06,70.500000,72.705002,70.307503,72.257500,71.701706,226176800

The first line shows the column names (Date, Open, High, Low, Close, Adj Close, Volume). Then each line contains a data entry for a given day.

Step 2: Read the stock prices from CSV to Pandas DataFrame

n Jupyter Notebook start by importing the Pandas library. This is needed in order to load the data into a DataFrame.

import pandas as pd
data = pd.read_csv("AAPL.csv", index_col=0, parse_dates=True)
data.head()

The read_csv(…) does all the magic for us. It will read the CSV file AAPL.csv. The AAPL.csv file is the one you downloaded from Yahoo! Finance (or from the zip-file downloaded above) and needs to be located in the same folder you are working from in your Jupyter notebook.

The arguments in read_csv(…) are the following.

  • index_col=0 this sets the first column of the CSV file to be the index. In this case, it is the Date column.
  • parse_dates=True this ensures that dates in the CSV file are interpreted as dates. This is important if you want to take advantage of the index being a time.

Step 3: Explore data types of columns and index

In the video lesson we explore the type of columns and index.

data.dtypes
data.index

Which will reveal the data type and index of the DataFrame. Notice, that each column has its own data type.

Step 4: Indexing and slicing with DataFrames

We can use loc to lookup an index with a date.

data.loc['2020-01-27']

This will show the data for that specific date. If you get an error it might be because your dataset does not contain the above date. Choose another one to see something similar to this.

Open         7.751500e+01
High         7.794250e+01
Low          7.622000e+01
Close        7.723750e+01
Adj Close    7.657619e+01
Volume       1.619400e+08
Name: 2020-01-27 00:00:00, dtype: float64

A more advanced option is to use an interval (or slice as it is called). Slicing with loc on a DataFrame is done by using a starting and ending index .loc[start:end] or an open ended index .loc[start:], which will take data beginning from start to the last data.

data.loc['2021-01-01':]

This will give all the data starting from 2020-01-01. Notice, that there is no data on January 1st, but since the index is interpreted as a datetime, it can figure out the first date after.

            Open        High        Low         Close    Adj Close       Volume
Date                        
2021-01-04  133.520004  133.610001  126.760002  129.410004  129.410004  143301900
2021-01-05  128.889999  131.740005  128.429993  131.009995  131.009995  97664900
2021-01-06  127.720001  131.050003  126.379997  126.599998  126.599998  155088000
2021-01-07  128.360001  131.630005  127.860001  130.919998  130.919998  109578200
2021-01-08  132.429993  132.630005  130.229996  132.050003  132.050003  105158200
2021-01-11  129.190002  130.169998  128.500000  128.979996  128.979996  100620900

Similarly, you can create slicing with an open-ended start.

data.loc[:'2020-07-01']

Another important way to index into DataFrames is by iloc[], which does it with index.

data.iloc[0]
data.iloc[-1]

Where you can index from the start with index 0, 1, 2, 3, … Or from the end -1, -2, -3, -4, …

What is next?

Want to learn more?

This is part of the FREE online course on my page. No signup required and 2 hours of free video content with code and Jupyter Notebooks available on GitHub.

Follow the link and read more.

Calculate the Volatility of Historic Stock Prices with Pandas and Python

What will we cover in this tutorial?

We will calculate the volatility of historic stock prices with Python library Pandas.

Step 1: Read Historic Stock Prices with Pandas Datareader

We will use Pandas Datareader to read some historic stock prices. See this tutorial for details.

import pandas_datareader as pdr
import datetime as dt

ticker = "AAPL"
start = dt.datetime(2019, 1, 1)
end = dt.datetime(2020, 12, 31)

data = pdr.get_data_yahoo(ticker, start, end)

print(data.head())

Resulting in this.

                 High        Low       Open      Close       Volume  Adj Close
Date                                                                          
2019-01-02  39.712502  38.557499  38.722500  39.480000  148158800.0  38.505024
2019-01-03  36.430000  35.500000  35.994999  35.547501  365248800.0  34.669640
2019-01-04  37.137501  35.950001  36.132500  37.064999  234428400.0  36.149662
2019-01-07  37.207500  36.474998  37.174999  36.982498  219111200.0  36.069202
2019-01-08  37.955002  37.130001  37.389999  37.687500  164101200.0  36.756794

Step 2: Calculate the Volatility of an Asset

Let’s explore the difference between daily simple returns and daily log returns. Shortly explained, the log returns have the advantage that you can add them together, while this is not the case for simple returns. Therefore the log returns are used in most financial analysis.

To calculate the daily log returns we need the NumPy library. For the purpose here, we will not explore the depths of NumPy, all we need is to apply the log-function on a full column in our DataFrame (see my other FREE course for more details on NumPy).

import numpy as np

data['Log returns'] = np.log(data['Close']/data['Close'].shift())

This creates a column called Log returns with the daily log return of the Close price.

We need the standard deviation for the volatility of the stock.

This can be calculated from our Log returns as follows.

data['Log returns'].std()

The above gives the daily standard deviation. The volatility is defined as the annualized standard deviation. Using the above formula we can calculate it as follows.

volatility = data['Log returns'].std()*252**.5

Notice that square root is the same as **.5, which is the power of 1/2.

Step 3: Visualize the Volatility of Historic Stock Prices

This can be visualized with Matplotlib.

str_vol = str(round(volatility, 4)*100)

fig, ax = plt.subplots()
data[‘Log returns’].hist(ax=ax, bins=50, alpha=0.6, color=’b’)
ax.set_xlabel(“Log return”)
ax.set_ylabel(“Freq of log return”)
ax.set_title(“AAPL volatility: ” + str_vol + “%”)

Resulting in the following output.

Next steps?

Want to learn more?

This is part of the FREE online course on my page. No signup required and 2 hours of free video content with code and Jupyter Notebooks available on GitHub.

Follow the link and read more.