Pandas: How to Sum Groups from HTML Tables

What will we cover in this tutorial?

  • How to collect data from a HTML table into a Pandas DataFrame.
  • The cleaning process and how to convert the data into the correct type.
  • Also, dealing with some data points that are not in correct representation.
  • Finally, how to sum up by countries.

Step 1: Collect the data from the table

Pandas is an amazing library with a lot of useful data analysis functionality right out of the box. First step in any data analysis is to collect the data. In this tutorial we will collect the data from wikipedia’s page on List of metro systems.

If you are new to the pandas library we recommend you read the this tutorial.

The objective will be to find the sums of Stations, Systems length, and Annual ridership per each country.

From wikipedia.org

At first glance this looks simple, but looking further down we see that some countries have various rows.

From wikipedia.org

Also, some rows do not have all the values needed.

First challenge first. Read the data from the table into a DataFrame, which is the main data structure of the pandas library. The read_html call from a pandas will return a list of DataFrames.

If you use read_html for the first time, we recommend you read this tutorial.

import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_metro_systems'
tables = pd.read_html(url)
table = tables[0]
print(table)

Which results in the following output (or the top of it).

                 City               Country                                Name        Yearopened Year of lastexpansion             Stations                       System length             Annual ridership(millions)
0             Algiers               Algeria                       Algiers Metro          2011[13]              2018[14]               19[14]               18.5 km (11.5 mi)[15]                       45.3 (2019)[R 1]
1        Buenos Aires             Argentina            Buenos Aires Underground        1926[Nb 1]              2019[16]               90[17]               56.7 km (35.2 mi)[17]                      337.7 (2018)[R 2]
2             Yerevan               Armenia                       Yerevan Metro          1981[18]              1996[19]               10[18]                13.4 km (8.3 mi)[18]                       18.7 (2018)[R 3]
3              Sydney             Australia                        Sydney Metro          2019[20]                     –               13[20]               36 km (22 mi)[20][21]              14.2 (2019) [R 4][R Nb 1]
4              Vienna               Austria                       Vienna U-Bahn    1976[22][Nb 2]              2017[23]               98[24]               83.3 km (51.8 mi)[22]                      463.1 (2018)[R 6]
5                Baku            Azerbaijan                          Baku Metro          1967[25]              2016[25]               25[25]               36.6 km (22.7 mi)[25]                      231.0 (2018)[R 3]

We have now have the data in a DataFrame.

Step 2: Clean and convert the data

At first glance, we see that we do not need the rows City, Name, Yearopened, Year of last expansion. To make it easier to work with the data, let’s remove them and inspect the data again.

import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_metro_systems'
tables = pd.read_html(url)
table = tables[0]
table = table.drop(['City', 'Name', 'Yearopened', 'Year of lastexpansion'], axis=1)
print(table)

Which result in the following output.

                  Country             Stations                       System length             Annual ridership(millions)
0                 Algeria               19[14]               18.5 km (11.5 mi)[15]                       45.3 (2019)[R 1]
1               Argentina               90[17]               56.7 km (35.2 mi)[17]                      337.7 (2018)[R 2]
2                 Armenia               10[18]                13.4 km (8.3 mi)[18]                       18.7 (2018)[R 3]
3               Australia               13[20]               36 km (22 mi)[20][21]              14.2 (2019) [R 4][R Nb 1]
4                 Austria               98[24]               83.3 km (51.8 mi)[22]                      463.1 (2018)[R 6]
5              Azerbaijan               25[25]               36.6 km (22.7 mi)[25]                      231.0 (2018)[R 3]
6                 Belarus               29[27]               37.3 km (23.2 mi)[27]                      283.4 (2018)[R 3]
7                 Belgium         59[28][Nb 5]               39.9 km (24.8 mi)[29]                      165.3 (2019)[R 7]

This makes it easier to see the next steps.

Let’s take them one by one. Stations need to remove the data after ‘[‘-symbol and convert the number to an integer. This can be done by using a lambda function to a row.

table['Stations'] = table.apply(lambda row: int(row['Stations'].split('[')[0]), axis=1)

If you are new to lambda functions we recommend you read this tutorial.

The next thing we need to do is to convert the System length to floats. The length will be in km (I live in Denmark, where we use km and not mi). This can also be done by using a lambda function

table['System length'] = table.apply(lambda row: float(row['System length'].split()[0]), axis=1)

Finally, and a bit more tricky, we need to convert the column of Annual ridership. The challenge is that lines have n/a which are converted to np.nan, but there are also some lines where the input is not easy to convert, as the images show.

From wikipedia.org
From wikipedia.org

These lines are can be dealt with by using a helper function.

def to_float(obj):
    try:
        return float(obj)
    except:
        return np.nan

index = 'Annual ridership(millions)'
table[index] = table.apply(lambda row: to_float(row[index].split()[0]) if row[index] is not np.nan else np.nan, axis=1)

Adding this all together we get the following code.

import pandas as pd
import numpy as np

def to_float(obj):
    try:
        return float(obj)
    except:
        return np.nan

url = 'https://en.wikipedia.org/wiki/List_of_metro_systems'
tables = pd.read_html(url)
table = tables[0]
table = table.drop(['City', 'Name', 'Yearopened', 'Year of lastexpansion'], axis=1)

table['Stations'] = table.apply(lambda row: int(row['Stations'].split('[')[0]), axis=1)
table['System length'] = table.apply(lambda row: float(row['System length'].split()[0]), axis=1)
index = 'Annual ridership(millions)'
table[index] = table.apply(lambda row: to_float(row[index].split()[0]) if row[index] is not np.nan else np.nan, axis=1)

print(table)

Which results in the following output (or the first few lines).

                  Country  Stations  System length  Annual ridership(millions)
0                 Algeria        19          18.50                       45.30
1               Argentina        90          56.70                      337.70
2                 Armenia        10          13.40                       18.70
3               Australia        13          36.00                       14.20
4                 Austria        98          83.30                      463.10
5              Azerbaijan        25          36.60                      231.00
6                 Belarus        29          37.30                      283.40
7                 Belgium        59          39.90                      165.30
8                  Brazil        19          28.10                       58.40
9                  Brazil        25          42.40                       42.80
10                 Brazil        22          43.80                       51.70

Step 3: Sum rows by country

Say, now we want to get the country with the most metro stations. This can be achieved by using the groupby and sum function from the pandas DataFrame data structure.

import pandas as pd
import numpy as np

def to_float(obj):
    try:
        return float(obj)
    except:
        return np.nan

url = 'https://en.wikipedia.org/wiki/List_of_metro_systems'
tables = pd.read_html(url)
table = tables[0]
table = table.drop(['City', 'Name', 'Yearopened', 'Year of lastexpansion'], axis=1)

table['Stations'] = table.apply(lambda row: int(row['Stations'].split('[')[0]), axis=1)
table['System length'] = table.apply(lambda row: float(row['System length'].split()[0]), axis=1)
index = 'Annual ridership(millions)'
table[index] = table.apply(lambda row: to_float(row[index].split()[0]) if row[index] is not np.nan else np.nan, axis=1)

# Sum up
table_sum = table.groupby(['Country']).sum()

print(table_sum.sort_values(['Stations'], ascending=False))

Where the result will be China.

                      Stations  System length  Annual ridership(millions)
Country                                                                  
China                     3738        6312.16                    25519.23
United States             1005        1325.90                     2771.50
South Korea                714         839.90                     4054.90
Japan[Nb 34]               669         791.20                     6489.60
India                      499         675.97                     1377.00
France                     483         350.90                     2113.50
Spain                      438         474.40                     1197.90

If we want to sort by km of System length, you will only need to change the last line to the following.

print(table_sum.sort_values(['System length'], ascending=False))

Resulting in the following.

                      Stations  System length  Annual ridership(millions)
Country                                                                  
China                     3738        6312.16                    25519.23
United States             1005        1325.90                     2771.50
South Korea                714         839.90                     4054.90
Japan[Nb 34]               669         791.20                     6489.60
India                      499         675.97                     1377.00
Russia                     368         611.50                     3507.60
United Kingdom             390         523.90                     1555.30

Finally, if you want it by Annual ridership, you will need to change the last line to.

print(table_sum.sort_values([index], ascending=False))

Remember, we assigned that to index. You should get the following output.

                      Stations  System length  Annual ridership(millions)
Country                                                                  
China                     3738        6312.16                    25519.23
Japan[Nb 34]               669         791.20                     6489.60
South Korea                714         839.90                     4054.90
Russia                     368         611.50                     3507.60
United States             1005        1325.90                     2771.50
France                     483         350.90                     2113.50
Brazil                     243         345.40                     2106.20

Master Data Correlation with Pandas DataFrame in 3 Easy Steps

What will we cover in this tutorial?

  • How to get data using Pandas DataFrames.
  • Clean the data and merge it together.
  • Finally, how to see if there is any correlation between data columns.

Step 1: Get the data you want to correlate

As an example, let’s assume you get the idea that there might be a correlation between GDP per capita, Social Progress Index (SPI), and Human Development Index (HDI), but is not sure whether SPI or HDI is closets correlated to GDP per capita.

Luckily, you have pandas to the rescue.

As the data is in three pages, you need to collect it by separately and merge it later. First, let us collect the data and inspect it.

The GDP per capita is located in the table on wikipedia presented in the picture below.

From wikipedia.org

Which is actually three tables. We will use the World Bank table in our example. It can be collected by using a call to pandas read_html. If you are new to read_html we recommend you read this tutorial.

import pandas as pd

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

table = tables[3]
print(table)

Which will give an output similar to this.

    Rank                       Country/Territory     US$
0      1                           Monaco (2018)  185741
1      2                    Liechtenstein (2017)  173356
2      3                              Luxembourg  114705
3      —                                   Macau   84096
4      4                             Switzerland   81994
5      5                                 Ireland   78661
6      6                                  Norway   75420
7      7                                 Iceland   66945

The next table we need to get is the Social Progress Index (SPI) and looks like the picture shows below.

From wikipedia.org

This can be collected to a DataFrame with the following code.

import pandas as pd

url = 'https://en.wikipedia.org/wiki/Social_Progress_Index'
tables = pd.read_html(url)
print(tables[1])

Which will print the following to the screen (or the top of it).

                          Country 2019[9]               
                          Country    Rank  Score Score.1
0                          Norway       1  90.95     NaN
1                         Denmark       2  90.09     NaN
2                     Switzerland       3  89.89     NaN
3                         Finland       4  89.56     NaN
4                          Sweden       5  89.45     NaN

Finally we need to read the Human Development Index (HDI), which can be seen on wikipedia as the following picture shows.

From wikipedia.org

And can be collected with the following code.

import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index'
tables = pd.read_html(url)

print(tables[1])

Resulting in the following output.

                           Rank                                             Country or Territory                         HDI                                                    Unnamed: 5_level_0
    2018 data (2019 report)​[5] Change from previous year​[5]               Country or Territory 2018 data (2019 report)​[5] Average annual HDI growth (2010–2018)​[6]          Unnamed: 5_level_1
    Very high human development   Very high human development        Very high human development Very high human development               Very high human development Very high human development
0                             1                           NaN                             Norway                       0.954                                     0.16%                         NaN
1                             2                           NaN                        Switzerland                       0.946                                     0.18%                         NaN
2                             3                           NaN                            Ireland                       0.942                                     0.71%                         NaN
3                             4                           NaN                            Germany                       0.939                                     0.25%                         NaN
4                             4                           (2)                          Hong Kong                       0.939                                     0.51%                         NaN
5                             6                           (1)                          Australia                       0.938                                     0.17%                         NaN

A bit more messy data table.

Now we have gathered all the data we need to clean it up and merge it together.

Step 2: Clean and merge the data into one DataFrame

If we first inspect the data from the GDP per capita.

    Rank                       Country/Territory     US$
0      1                           Monaco (2018)  185741
1      2                    Liechtenstein (2017)  173356
2      3                              Luxembourg  114705
3      —                                   Macau   84096
4      4                             Switzerland   81994
5      5                                 Ireland   78661
6      6                                  Norway   75420
7      7                                 Iceland   66945

Notice that Country/Territory can have a year in parentheses, which will make it hard to merge. We need to clean that up. Also, we do not need the Rank column.

If we inspect the data of HDI.

                          Country 2019[9]               
                          Country    Rank  Score Score.1
0                          Norway       1  90.95     NaN
1                         Denmark       2  90.09     NaN
2                     Switzerland       3  89.89     NaN
3                         Finland       4  89.56     NaN
4                          Sweden       5  89.45     NaN

Here we notice that the first row is an additional description row, which we can remove. Further, we do not need the Rank and Score.1 columns.

Let’s try to merge it together. Notice that we use a lambda function to clean up the Country/Territory names. If you are new to lambda functions, we recommend you read this tutorial.

import pandas as pd

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# The data is in table 3
table = tables[3]

# We need to clean the years in parenthesis from the country/territory field
table['Country'] = table.apply(lambda row: row['Country/Territory'].split(' (')[0], axis=1)
# We do not need the Rank and Country/Territory for more data
table = table.drop(['Rank', 'Country/Territory'], axis=1)

url = 'https://en.wikipedia.org/wiki/Social_Progress_Index'
tables = pd.read_html(url)

merge_table = tables[1]
# The first level of the table can be dropped
merge_table.columns = merge_table.columns.droplevel(0)
# We do not need the Rank and Score.1 columns
merge_table = merge_table.drop(['Rank', 'Score.1'], axis=1)
# Need to rename the second column to SPI = Social Progress Index
merge_table.columns = ['Country', 'SPI']

# Ready to merge the tables
table = table.merge(merge_table, how="left", left_on=['Country'], right_on=['Country'])

print(table)

Which will result in an output like this.

        US$                            Country    SPI
0    185741                             Monaco    NaN
1    173356                      Liechtenstein    NaN
2    114705                         Luxembourg  87.66
3     84096                              Macau    NaN
4     81994                        Switzerland  89.89
5     78661                            Ireland  87.97
6     75420                             Norway  90.95

First validate that Monaco, Liechtenstein, and Macau do not have any SPI value. That seems to be correct.

Then we can proceed to the next table of HDI. Let us first inspect the data.

                           Rank                                             Country or Territory                         HDI                                                    Unnamed: 5_level_0
    2018 data (2019 report)​[5] Change from previous year​[5]               Country or Territory 2018 data (2019 report)​[5] Average annual HDI growth (2010–2018)​[6]          Unnamed: 5_level_1
    Very high human development   Very high human development        Very high human development Very high human development               Very high human development Very high human development
0                             1                           NaN                             Norway                       0.954                                     0.16%                         NaN
1                             2                           NaN                        Switzerland                       0.946                                     0.18%                         NaN
2                             3                           NaN                            Ireland                       0.942                                     0.71%                         NaN
3                             4                           NaN                            Germany                       0.939                                     0.25%                         NaN
4                             4                           (2)                          Hong Kong                       0.939                                     0.51%                         NaN
5                             6                           (1)                          Australia                       0.938                                     0.17%                         NaN

It has a quite messy top level column naming in 3 layers. Dropping them will make some identical. To deal with that, we can rename them and delete those we do not need.

import pandas as pd

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# The data is in table 3
table = tables[3]

# We need to clean the years in parenthesis from the country/territory field
table['Country'] = table.apply(lambda row: row['Country/Territory'].split(' (')[0], axis=1)
# We do not need the Rank and Country/Territory for more data
table = table.drop(['Rank', 'Country/Territory'], axis=1)

url = 'https://en.wikipedia.org/wiki/Social_Progress_Index'
tables = pd.read_html(url)

merge_table = tables[1]
# The first level of the table can be dropped
merge_table.columns = merge_table.columns.droplevel(0)
# We do not need the Rank and Score.1 columns
merge_table = merge_table.drop(['Rank', 'Score.1'], axis=1)
# Need to rename the second column to SPI = Social Progress Index
merge_table.columns = ['Country', 'SPI']

# Ready to merge the tables
table = table.merge(merge_table, how="left", left_on=['Country'], right_on=['Country'])

url = 'https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index'
tables = pd.read_html(url)


merge_table = tables[1]
# Delete the additional column levels
merge_table.columns = merge_table.columns.droplevel(1)
merge_table.columns = merge_table.columns.droplevel(1)
# Rename the columns
merge_table.columns = ['Rank1', 'Rank2', 'Country', 'HDI', 'HDI-1', 'None']
# Delete the columns we do not need
merge_table = merge_table.drop(['Rank1', 'Rank2', 'HDI-1', 'None'], axis=1)
# Merge the tables
table = table.merge(merge_table, how="left", left_on=['Country'], right_on=['Country'])

print(table)

Which will result in the following output (or the top of it).

        US$                            Country    SPI    HDI
0    185741                             Monaco    NaN    NaN
1    173356                      Liechtenstein    NaN  0.917
2    114705                         Luxembourg  87.66  0.909
3     84096                              Macau    NaN    NaN
4     81994                        Switzerland  89.89  0.946
5     78661                            Ireland  87.97  0.942
6     75420                             Norway  90.95  0.954

Notice, that here Liechtenstein has HDI data, while Monaco and Macau do not have any data. While it is not visible, the HDI column is not made of float. It can be seen with a call to table.dtypes, which will output the following.

US$          int64
Country     object
SPI        float64
HDI         object
dtype: object

Which states that HDI is object, which in this case is a string. That means we need to convert it to float to make our final correlation computations. This can be done by using a lambda function.

table['HDI'] = table.apply(lambda row: float(row['HDI']) if row['HDI'] is not np.nan else np.nan, axis=1) # HDI = Human Development Index

This actually makes the data ready to see if there is any correlations between GDP per capita and SPI and/or HDI.

Step 3: Calculate the correlations

This is where the DataFrames from pandas come strong. It can do the entire work for you with one call to corr().

The full code is given below.

import pandas as pd
import numpy as np


pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 1000)


# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# The data is in table 3
table = tables[3]

# We need to clean the years in parenthesis from the country/territory field
table['Country'] = table.apply(lambda row: row['Country/Territory'].split(' (')[0], axis=1)
# We do not need the Rank and Country/Territory for more data
table = table.drop(['Rank', 'Country/Territory'], axis=1)

url = 'https://en.wikipedia.org/wiki/Social_Progress_Index'
tables = pd.read_html(url)

merge_table = tables[1]
# The first level of the table can be dropped
merge_table.columns = merge_table.columns.droplevel(0)
# We do not need the Rank and Score.1 columns
merge_table = merge_table.drop(['Rank', 'Score.1'], axis=1)
# Need to rename the second column to SPI = Social Progress Index
merge_table.columns = ['Country', 'SPI']

# Ready to merge the tables
table = table.merge(merge_table, how="left", left_on=['Country'], right_on=['Country'])

url = 'https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index'
tables = pd.read_html(url)


merge_table = tables[1]
# Delete the additional column levels
merge_table.columns = merge_table.columns.droplevel(1)
merge_table.columns = merge_table.columns.droplevel(1)
# Rename the columns
merge_table.columns = ['Rank1', 'Rank2', 'Country', 'HDI', 'HDI-1', 'None']
# Delete the columns we do not need
merge_table = merge_table.drop(['Rank1', 'Rank2', 'HDI-1', 'None'], axis=1)
# Merge the tables
table = table.merge(merge_table, how="left", left_on=['Country'], right_on=['Country'])

# Convert to floats
table['HDI'] = table.apply(lambda row: float(row['HDI']) if row['HDI'] is not np.nan else np.nan, axis=1) # HDI = Human Development Index

# Calculate the correlation
table_corr = table.corr()

# Print the correlation to GDP per capita (stored in US$).
print(table_corr['US$'].sort_values(ascending=False))

Which will result in the following output.

US$    1.000000
SPI    0.713946
HDI    0.663183
Name: US$, dtype: float64

Hence, it seems that there is the biggest correlation between GDP per capita and SPI.

Notice, that the calculations ignores all Not a Number (np.nan).

How to Visualize Time Series Financial Data with Python in 3 Easy Steps

What will we cover

  • The easy way visualize financial data with Python
  • How to fetch data from stocks in Python
  • The easy way to visualize it on a plot
  • How to enrich the data with valuable analysis

Step 1: Fetch the data on your favorite stock

As with most things in Python, somebody made an easy to use library to do all the hard work for you. This is also true if you want to visualize financial data with Python.

The Pandas datareader library lets you fetch financial data from various sources and return them in a Pandas Dataframe.

If you do not know what a Dataframe from Pandas is, do not worry. We will cover the necessary here.

The full list data sources available can be found here, but include Tiingo, IEX, Alpha Vantage, Enigma, Quandl, St.Louis FED (FRED), Kenneth French’s data library, World Bank, and many more.

import pandas_datareader as pdr
import datetime 


aapl = pdr.get_data_yahoo('AAPL', 
                          start=datetime.datetime(2010, 1, 1), 
                          end=datetime.datetime(2020, 1, 1))

print(aapl.head())

Which will result in 10 years of data from Apple. See below.

                 High        Low  ...       Volume  Adj Close
Date                              ...                        
2010-01-04  30.642857  30.340000  ...  123432400.0  26.466835
2010-01-05  30.798571  30.464285  ...  150476200.0  26.512596
2010-01-06  30.747143  30.107143  ...  138040000.0  26.090879
2010-01-07  30.285715  29.864286  ...  119282800.0  26.042646
2010-01-08  30.285715  29.865715  ...  111902700.0  26.215786

For each bank day you will be presented with the following data.

High         2.936800e+02
Low          2.895200e+02
Open         2.899300e+02
Close        2.936500e+02
Volume       2.520140e+07
Adj Close    2.921638e+02
Name: 2019-12-31 00:00:00, dtype: float64
  • High: The highest price traded during that day.
  • Low: The lowest price traded during that day.
  • Open: The opening price that day.
  • Close: The closing price that day, that is the price of the last trade that day.
  • Volume: The number of shares that exchange hands for the stock that day.
  • Adj Close: It accurately reflect the stock’s value after accounting for any corporate actions. It is considered to be the true price of that stock and is often used when examining historical returns.

As you can see, to make further infestations on the data, you should use the Adj Close.

Step 2: Visualize the data

This is where Dataframes come in handy. They integrate easy with matplotlib, which is a comprehensive library for creating static, animated, and interactive visualizations in Python.

import pandas_datareader as pdr
import datetime 
import matplotlib.pyplot as plt


aapl = pdr.get_data_yahoo('AAPL', 
                          start=datetime.datetime(2010, 1, 1), 
                          end=datetime.datetime(2020, 1, 1))

aapl['Adj Close'].plot()
plt.show()

Will result in a graph like this.

Apple stock price.

That was easy.

You can see further ways to visualize the data in the documentation here.

Step 3: Enrich the data

That is another great advantage of Dataframes, it is easy to enrich it with valuable data.

A good example is to enrich the data with running mean values of the stock price.

import pandas_datareader as pdr
import datetime 
import matplotlib.pyplot as plt


aapl = pdr.get_data_yahoo('AAPL', 
                          start=datetime.datetime(2015, 1, 1), 
                          end=datetime.datetime(2020, 1, 1))

aapl['Mean Short'] = aapl['Adj Close'].rolling(window=20).mean()
aapl['Mean Long'] = aapl['Adj Close'].rolling(window=100).mean()
aapl[['Adj Close', 'Mean Short', 'Mean Long']].plot()
plt.show()

Which will result in the following graph.

Apple stock with rolling mean values with window of 20 and 100 days.

Now that was simple. The code simple rolls the window of 20 (or 100) days and sets the average (mean) value of that. This kind of analysis is used to see the average trend of the stock. Also, to see when the short term trend (20 days) is crossing the long term trend (100 days).

A simple trading strategy is decides on to buy and sell based on when these averages crosses. This strategy is called dual moving average crossover. You can read more about it here.

A volatility analysis is used to see how “stable” the share is. The higher the volatility value is, the riskier it is.

import pandas_datareader as pdr
import datetime 
import matplotlib.pyplot as plt
import numpy as np


aapl = pdr.get_data_yahoo('AAPL', 
                          start=datetime.datetime(2015, 1, 1), 
                          end=datetime.datetime(2020, 1, 1))


daily_close = aapl['Close']
daily_pc = daily_close.pct_change()

# See the volatibility
vol = daily_pc.rolling(75).std()*np.sqrt(75)
vol.plot()
plt.show()

Which results in the following graph.

Volatility of Apple stock

This is a good indication on how risky the stock is.

Conclusion

This is just a simple introduction to how to retrieve financial stock data in Python and visualize it. Also, how easy it is to enrich it with more valuable analysis.

There are so much more to explore and learn about it.

Build a Financial Trading Algorithm in Python in 5 Easy Steps

What will we cover in this tutorial?

Step 1: Get time series data on your favorite stock

To build a financial trading algorithm in Python, it needs to be fed with data. Hence, the first step you need to master is how to collect time series data on your favorite stock. Sounds like it is difficult, right?

Luckily, someone made an awesome library pandas-datareader, which does all the hard word you for you. Let’s take a look on how easy it is.

import datetime as dt
import pandas_datareader as pdr
import matplotlib.pyplot as plt
from dateutil.relativedelta import relativedelta


# A Danish jewellery manufacturer and retailer
stock = 'PNDORA.CO'
end = dt.datetime.now()
start = end - relativedelta(years=10)
pndora = pdr.get_data_yahoo(stock, start, end)

pndora['Close'].plot()

plt.show()

Which results in a time series of the closing price as shown here.

Pandora A/S time series stock price

The stock is probably quite unknown, considering it is a Danish company. But to prove a point that you can get data, including for Pandora.

In the code you see that you send a start and end date to the call fetching the data. Here we have taken the last 10 years. The object returned integrates well with the matplotlib library to make the plot.

To understand the data better, we need to explore further.

Step 2: Understand the data available

Let us explore the data object returned by the call (pndora).

To get an overview you should run the following code using the iloc-call to a the Dataframe object (pndora returned by pandas_datareader).

pndora.iloc[-1]

This will show what the last item of the object looks like.

High            365.000000
Low             355.600006
Open            360.000000
Close           356.500000
Volume       446004.000000
Adj Close       356.500000
Name: 2020-06-26 00:00:00, dtype: float64

Where you have the following items.

  • High: The highest price traded during that day.
  • Low: The lowest price traded during that day.
  • Open: The opening price that day.
  • Close: The closing price that day, that is the price of the last trade that day.
  • Volume: The number of shares that exchange hands for the stock that day.
  • Adj Close: It accurately reflect the stock’s value after accounting for any corporate actions. It is considered to be the true price of that stock and is often used when examining historical returns.

That means, it would be natural to use Adj Close in our calculations. Hence, for each day we have the above information available.

Step 3: Learning how to enrich the data (Pandas)

Pandas? Yes, you read correct. But not a Panda like this the picture.

Pandas the Python library.
Pandas the Python library.

There is an awesome library Pandas in Python to make data analysis easy.

Let us explore some useful things we can do.

import datetime as dt
import pandas_datareader as pdr
import matplotlib.pyplot as plt
from dateutil.relativedelta import relativedelta


# A Danish jewellery manufacturer and retailer
stock = 'PNDORA.CO'
end = dt.datetime.now()
start = end - relativedelta(years=10)
pndora = pdr.get_data_yahoo(stock, start, end)

pndora['Short'] = pndora['Adj Close'].rolling(window=20).mean()
pndora['Long'] = pndora['Adj Close'].rolling(window=100).mean()

pndora[['Adj Close', 'Short', 'Long']].plot()
plt.show()

Which will result in the following graph.

Pandora A/S stock prices with Short mean average and Long mean average.
Pandora A/S stock prices with Short mean average and Long mean average.

If you inspect the code above, you see, that you easily added to two new columns (Short and Long) and computed them to be the mean value of the previous 20 and 100 days, respectively.

Further, you can add the daily percentage change for the various entries.

import datetime as dt
import pandas_datareader as pdr
import matplotlib.pyplot as plt
from dateutil.relativedelta import relativedelta


# A Danish jewellery manufacturer and retailer
stock = 'PNDORA.CO'
end = dt.datetime.now()
start = end - relativedelta(years=10)
pndora = pdr.get_data_yahoo(stock, start, end)

pndora['Short'] = pndora['Adj Close'].rolling(window=20).mean()
pndora['Long'] = pndora['Adj Close'].rolling(window=100).mean()

pndora['Pct Change'] = pndora['Adj Close'].pct_change()
pndora['Pct Short'] = pndora['Short'].pct_change()
pndora['Pct Long'] = pndora['Long'].pct_change()

pndora[['Pct Change', 'Pct Short', 'Pct Long']].loc['2020'].plot()
plt.show()

Which will result in this graph.

Pandora A/S stock's daily percentage change for Adj Close, short and long mean.
Pandora A/S stock’s daily percentage change for Adj Close, short and long mean.

Again you can see how easy it is to add new columns in the Dataframe object provided by Pandas library.

Step 4: Building your strategy to buy and sell stocks

For the example we will keep it simple and only focus on one stock. The strategy we will use is called the dual moving average crossover.

Simply explained, you want to buy stocks when the short mean average is higher than the long mean average value.

In the figure above, it is translates to.

  • Buy when the yellow crosses above the green line.
  • Sell when the yellow crosses below the green line.

To implement the simplest version of that it would be as follows.

import datetime as dt
import pandas_datareader as pdr
import matplotlib.pyplot as plt
import numpy as np
from dateutil.relativedelta import relativedelta


# A Danish jewellery manufacturer and retailer
stock = 'PNDORA.CO'
end = dt.datetime.now()
start = end - relativedelta(years=10)
pndora = pdr.get_data_yahoo(stock, start, end)

short_window = 20
long_window = 100

pndora['Short'] = pndora['Adj Close'].rolling(window=short_window).mean()
pndora['Long'] = pndora['Adj Close'].rolling(window=long_window).mean()

# Let us create some signals
pndora['signal'] = 0.0
pndora['signal'][short_window:] = np.where(pndora['Short'][short_window:] > pndora['Long'][short_window:], 1.0, 0.0)

pndora['positions'] = pndora['signal'].diff()

To visually see where to buy and sell you can use the following code afterwards on pndora.

fig = plt.figure()

ax1 = fig.add_subplot(111, ylabel='Price')

pndora[['Adj Close', 'Short', 'Long']].plot(ax=ax1)

ax1.plot(pndora.loc[pndora.positions == 1.0].index,
         pndora.Short[pndora.positions == 1.0],
         '^', markersize=10, color='m')

ax1.plot(pndora.loc[pndora.positions == -1.0].index,
         pndora.Short[pndora.positions == -1.0],
         'v', markersize=10, color='k')

plt.show()

Which would result in the following graph.

Finally, you need to see how your algorithm performs.

Step 5: Testing you trading algorithm

There are many ways to test an algorithm. Here we go all in each cycle. We buy as much as we can and sell them all when we sell.

import datetime as dt
import pandas_datareader as pdr
import matplotlib.pyplot as plt
import numpy as np
from dateutil.relativedelta import relativedelta


# A Danish jewellery manufacturer and retailer
stock = 'PNDORA.CO'
end = dt.datetime.now()
start = end - relativedelta(years=10)
pndora = pdr.get_data_yahoo(stock, start, end)

short_window = 20
long_window = 100

pndora['Short'] = pndora['Adj Close'].rolling(window=short_window).mean()
pndora['Long'] = pndora['Adj Close'].rolling(window=long_window).mean()

# Let us create some signals
pndora['signal'] = 0.0
pndora['signal'][short_window:] = np.where(pndora['Short'][short_window:] > pndora['Long'][short_window:], 1.0, 0.0)

pndora['positions'] = pndora['signal'].diff()

cash = 1000000
stocks = 0
for index, row in pndora.iterrows():
    if row['positions'] == 1.0:
        stocks = int(cash//row['Adj Close'])
        cash -= stocks*row['Adj Close']
    elif row['positions'] == -1.0:
        cash += stocks*row['Adj Close']
        stocks = 0

print("Total:", cash + stocks*pndora.iloc[-1]['Adj Close'])

Which results in.

Total: 2034433.8826065063

That is a double in 10 years, which is less than 8% per year. Not so good.

As this is one specific stock, it is not fair to judge the algorithm being poor, it can be the stock which was not performing good, or the variables can be further adjusted.

If compared with the scenario where you bought the stocks at day one and sold them on the last day, your earnings would be 1,876,232. Hence, the algorithm beats that.

Conclusion and further work

This is a simple financial trading algorithm in Python and there are variables that can be adjusted. The algorithm was performing better than the naive strategy to buy on day one and sell on the last day.

It could be interesting to add more data into the decision in the algorithm, which might be some future work to do. Also, can it be combined with some Machine Learning?