Pandas and Folium: Categorize GDP Growth by Country and Visualize on Map in 3 Easy Steps

What will we cover in this tutorial?

  • We will gather data from wikipedia.org List of countries by past and projected GDP using pandas.
  • First step will be get the data and merge the correct tables together.
  • Next step is using Machine Learning with Linear regression model to estimate the growth of each country GDP.
  • Final step is to visualize the growth rates on a leaflet map using folium.

Step 1: Get the data and merge it

The data is available on wikipedia on List of countries by past and projected GDP. We will focus on data from 1990 to 2019.

At first glance on the page you notice that the date is not gathered in one table.

From wikipedia.org

The first task will be to merge the three tables with the data from 1990-1999, 2000-2009, and 2010-2019.

The data can be collected by pandas read_html function. If you are new to this you can read this tutorial.

import pandas as pd

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])

print(table)

The call to read_html will return all the tables in a list. By inspecting the results you will notice that we are interested in table 9, 12 and 15 and merge them. The output of the above will be.

     Country (or dependent territory)       1990       1991       1992       1993       1994       1995       1996       1997       1998       1999        2000        2001        2002        2003        2004        2005        2006        2007        2008        2009        2010        2011        2012        2013        2014        2015        2016        2017        2018        2019
0                         Afghanistan        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN         NaN         NaN      4367.0      4514.0      5146.0      6167.0      6925.0      8556.0     10297.0     12066.0     15325.0     17890.0     20296.0     20170.0     20352.0     19687.0     19454.0     20235.0     19585.0     19990.0
1                             Albania     2221.0     1333.0      843.0     1461.0     2361.0     2882.0     3200.0     2259.0     2560.0     3209.0      3483.0      3928.0      4348.0      5611.0      7185.0      8052.0      8905.0     10675.0     12901.0     12093.0     11938.0     12896.0     12323.0     12784.0     13238.0     11393.0     11865.0     13055.0     15202.0     15960.0
2                             Algeria    61892.0    46670.0    49217.0    50963.0    42426.0    42066.0    46941.0    48178.0    48188.0    48845.0     54749.0     54745.0     56761.0     67864.0     85327.0    103198.0    117027.0    134977.0    171001.0    137054.0    161207.0    199394.0    209005.0    209703.0    213518.0    164779.0    159049.0    167555.0    180441.0    183687.0
3                              Angola    11236.0    10891.0     8398.0     6095.0     4438.0     5539.0     6535.0     7675.0     6506.0     6153.0      9130.0      8936.0     12497.0     14189.0     19641.0     28234.0     41789.0     60449.0     84178.0     75492.0     82471.0    104116.0    115342.0    124912.0    126777.0    102962.0     95337.0    122124.0    107316.0     92191.0
4                 Antigua and Barbuda      459.0      482.0      499.0      535.0      589.0      577.0      634.0      681.0      728.0      766.0       825.0       796.0       810.0       850.0       912.0      1013.0      1147.0      1299.0      1358.0      1216.0      1146.0      1140.0      1214.0      1194.0      1273.0      1353.0      1460.0      1516.0      1626.0      1717.0
5                           Argentina   153205.0   205515.0   247987.0   256365.0   279150.0   280080.0   295120.0   317549.0   324242.0   307673.0    308491.0    291738.0    108731.0    138151.0    164922.0    199273.0    232892.0    287920.0    363545.0    334633.0    424728.0    527644.0    579666.0    611471.0    563614.0    631621.0    554107.0    642928.0    518092.0    477743.0
6                             Armenia        NaN        NaN      108.0      835.0      648.0     1287.0     1597.0     1639.0     1892.0     1845.0      1912.0      2118.0      2376.0      2807.0      3577.0      4900.0      6384.0      9206.0     11662.0      8648.0      9260.0     10142.0     10619.0     11121.0     11610.0     10529.0     10572.0     11537.0     12411.0     13105.0

Step 2: Use linear regression to estimate the growth over the last 30 years

In this section we will use Linear regression from the scikit-learn library, which is a simple prediction tool.

If you are new to Machine Learning we recommend you read this tutorial on Linear regression.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

import numpy as np

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])

row = table.iloc[1]
X = table.columns[1:].to_numpy().reshape(-1, 1)
X = X.astype(int)
Y = 1 + row.iloc[1:].pct_change()
Y = Y.cumprod().fillna(1.0).to_numpy()
Y = Y.reshape(-1, 1)

regr = LinearRegression()
regr.fit(X, Y)

Y_pred = regr.predict(X)

plt.scatter(X, Y)
plt.plot(X, Y_pred, color='red')
plt.show()

Which will result in the following plot.

Linear regression model applied on data from wikipedia.org

Which shows that the model approximates a line through the 30 years of data to estimate the growth of the country’s GDP.

Notice that we use the product (cumprod) of pct_change to be able to compare the data. If we used the data directly, we would not be possible to compare it.

We will do that for all countries to get a view of the growth. We are using the coefficient of the line, which indicates the growth rate.

import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])

coef = []
countries = []

for index, row in table.iterrows():
    #print(row)
    X = table.columns[1:].to_numpy().reshape(-1, 1)
    X = X.astype(int)
    Y = 1 + row.iloc[1:].pct_change()
    Y = Y.cumprod().fillna(1.0).to_numpy()
    Y = Y.reshape(-1, 1)

    regr = LinearRegression()
    regr.fit(X, Y)

    coef.append(regr.coef_[0][0])
    countries.append(row[merge_index])

data = pd.DataFrame(list(zip(countries, coef)), columns=['Country', 'Coef'])

print(data)

Which results in the following output (or the first few lines).

                              Country      Coef
0                         Afghanistan  0.161847
1                             Albania  0.243493
2                             Algeria  0.103907
3                              Angola  0.423919
4                 Antigua and Barbuda  0.087863
5                           Argentina  0.090837
6                             Armenia  4.699598

Step 3: Merge the data to a leaflet map using folium

The last step is to merge the data together with the leaflet map using the folium library. If you are new to folium we recommend you read this tutorial.

import pandas as pd
import folium
import geopandas
from sklearn.linear_model import LinearRegression
import numpy as np

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])

coef = []
countries = []

for index, row in table.iterrows():
    X = table.columns[1:].to_numpy().reshape(-1, 1)
    X = X.astype(int)
    Y = 1 + row.iloc[1:].pct_change()
    Y = Y.cumprod().fillna(1.0).to_numpy()
    Y = Y.reshape(-1, 1)

    regr = LinearRegression()
    regr.fit(X, Y)

    coef.append(regr.coef_[0][0])
    countries.append(row[merge_index])

data = pd.DataFrame(list(zip(countries, coef)), columns=['Country', 'Coef'])

# Read the geopandas dataset
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
# Replace United States of America to United States to fit the naming in the table
world = world.replace('United States of America', 'United States')

# Merge the two DataFrames together
table = world.merge(data, how="left", left_on=['name'], right_on=['Country'])


# Clean data: remove rows with no data
table = table.dropna(subset=['Coef'])

# We have 10 colors available resulting into 9 cuts.
table['Cat'] = pd.qcut(table['Coef'], 9, labels=[0, 1, 2, 3, 4, 5, 6, 7, 8])

print(table)

# Create a map
my_map = folium.Map()

# Add the data
folium.Choropleth(
    geo_data=table,
    name='choropleth',
    data=table,
    columns=['Country', 'Cat'],
    key_on='feature.properties.name',
    fill_color='YlGn',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Growth of GDP since 1990',
    threshold_scale=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
).add_to(my_map)
my_map.save('gdp_growth.html')

There is a twist in the way it is done. Instead of using a linear model to represent the growth rate on the map, we chose to add them in categories. The reason is that otherwise most countries group in small segment.

Here we have used the qcut to add them in each equal sized group.

This should result in an interactive html page looking something like this.

End result.

Leave a Reply