Pandas and Folium: Categorize GDP Growth by Country and Visualize on Map in 3 Easy Steps

What will we cover in this tutorial?

  • We will gather data from wikipedia.org List of countries by past and projected GDP using pandas.
  • First step will be get the data and merge the correct tables together.
  • Next step is using Machine Learning with Linear regression model to estimate the growth of each country GDP.
  • Final step is to visualize the growth rates on a leaflet map using folium.

Step 1: Get the data and merge it

The data is available on wikipedia on List of countries by past and projected GDP. We will focus on data from 1990 to 2019.

At first glance on the page you notice that the date is not gathered in one table.

From wikipedia.org

The first task will be to merge the three tables with the data from 1990-1999, 2000-2009, and 2010-2019.

The data can be collected by pandas read_html function. If you are new to this you can read this tutorial.

import pandas as pd

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])

print(table)

The call to read_html will return all the tables in a list. By inspecting the results you will notice that we are interested in table 9, 12 and 15 and merge them. The output of the above will be.

     Country (or dependent territory)       1990       1991       1992       1993       1994       1995       1996       1997       1998       1999        2000        2001        2002        2003        2004        2005        2006        2007        2008        2009        2010        2011        2012        2013        2014        2015        2016        2017        2018        2019
0                         Afghanistan        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN         NaN         NaN      4367.0      4514.0      5146.0      6167.0      6925.0      8556.0     10297.0     12066.0     15325.0     17890.0     20296.0     20170.0     20352.0     19687.0     19454.0     20235.0     19585.0     19990.0
1                             Albania     2221.0     1333.0      843.0     1461.0     2361.0     2882.0     3200.0     2259.0     2560.0     3209.0      3483.0      3928.0      4348.0      5611.0      7185.0      8052.0      8905.0     10675.0     12901.0     12093.0     11938.0     12896.0     12323.0     12784.0     13238.0     11393.0     11865.0     13055.0     15202.0     15960.0
2                             Algeria    61892.0    46670.0    49217.0    50963.0    42426.0    42066.0    46941.0    48178.0    48188.0    48845.0     54749.0     54745.0     56761.0     67864.0     85327.0    103198.0    117027.0    134977.0    171001.0    137054.0    161207.0    199394.0    209005.0    209703.0    213518.0    164779.0    159049.0    167555.0    180441.0    183687.0
3                              Angola    11236.0    10891.0     8398.0     6095.0     4438.0     5539.0     6535.0     7675.0     6506.0     6153.0      9130.0      8936.0     12497.0     14189.0     19641.0     28234.0     41789.0     60449.0     84178.0     75492.0     82471.0    104116.0    115342.0    124912.0    126777.0    102962.0     95337.0    122124.0    107316.0     92191.0
4                 Antigua and Barbuda      459.0      482.0      499.0      535.0      589.0      577.0      634.0      681.0      728.0      766.0       825.0       796.0       810.0       850.0       912.0      1013.0      1147.0      1299.0      1358.0      1216.0      1146.0      1140.0      1214.0      1194.0      1273.0      1353.0      1460.0      1516.0      1626.0      1717.0
5                           Argentina   153205.0   205515.0   247987.0   256365.0   279150.0   280080.0   295120.0   317549.0   324242.0   307673.0    308491.0    291738.0    108731.0    138151.0    164922.0    199273.0    232892.0    287920.0    363545.0    334633.0    424728.0    527644.0    579666.0    611471.0    563614.0    631621.0    554107.0    642928.0    518092.0    477743.0
6                             Armenia        NaN        NaN      108.0      835.0      648.0     1287.0     1597.0     1639.0     1892.0     1845.0      1912.0      2118.0      2376.0      2807.0      3577.0      4900.0      6384.0      9206.0     11662.0      8648.0      9260.0     10142.0     10619.0     11121.0     11610.0     10529.0     10572.0     11537.0     12411.0     13105.0

Step 2: Use linear regression to estimate the growth over the last 30 years

In this section we will use Linear regression from the scikit-learn library, which is a simple prediction tool.

If you are new to Machine Learning we recommend you read this tutorial on Linear regression.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

import numpy as np

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])

row = table.iloc[1]
X = table.columns[1:].to_numpy().reshape(-1, 1)
X = X.astype(int)
Y = 1 + row.iloc[1:].pct_change()
Y = Y.cumprod().fillna(1.0).to_numpy()
Y = Y.reshape(-1, 1)

regr = LinearRegression()
regr.fit(X, Y)

Y_pred = regr.predict(X)

plt.scatter(X, Y)
plt.plot(X, Y_pred, color='red')
plt.show()

Which will result in the following plot.

Linear regression model applied on data from wikipedia.org

Which shows that the model approximates a line through the 30 years of data to estimate the growth of the country’s GDP.

Notice that we use the product (cumprod) of pct_change to be able to compare the data. If we used the data directly, we would not be possible to compare it.

We will do that for all countries to get a view of the growth. We are using the coefficient of the line, which indicates the growth rate.

import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])

coef = []
countries = []

for index, row in table.iterrows():
    #print(row)
    X = table.columns[1:].to_numpy().reshape(-1, 1)
    X = X.astype(int)
    Y = 1 + row.iloc[1:].pct_change()
    Y = Y.cumprod().fillna(1.0).to_numpy()
    Y = Y.reshape(-1, 1)

    regr = LinearRegression()
    regr.fit(X, Y)

    coef.append(regr.coef_[0][0])
    countries.append(row[merge_index])

data = pd.DataFrame(list(zip(countries, coef)), columns=['Country', 'Coef'])

print(data)

Which results in the following output (or the first few lines).

                              Country      Coef
0                         Afghanistan  0.161847
1                             Albania  0.243493
2                             Algeria  0.103907
3                              Angola  0.423919
4                 Antigua and Barbuda  0.087863
5                           Argentina  0.090837
6                             Armenia  4.699598

Step 3: Merge the data to a leaflet map using folium

The last step is to merge the data together with the leaflet map using the folium library. If you are new to folium we recommend you read this tutorial.

import pandas as pd
import folium
import geopandas
from sklearn.linear_model import LinearRegression
import numpy as np

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])

coef = []
countries = []

for index, row in table.iterrows():
    X = table.columns[1:].to_numpy().reshape(-1, 1)
    X = X.astype(int)
    Y = 1 + row.iloc[1:].pct_change()
    Y = Y.cumprod().fillna(1.0).to_numpy()
    Y = Y.reshape(-1, 1)

    regr = LinearRegression()
    regr.fit(X, Y)

    coef.append(regr.coef_[0][0])
    countries.append(row[merge_index])

data = pd.DataFrame(list(zip(countries, coef)), columns=['Country', 'Coef'])

# Read the geopandas dataset
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
# Replace United States of America to United States to fit the naming in the table
world = world.replace('United States of America', 'United States')

# Merge the two DataFrames together
table = world.merge(data, how="left", left_on=['name'], right_on=['Country'])


# Clean data: remove rows with no data
table = table.dropna(subset=['Coef'])

# We have 10 colors available resulting into 9 cuts.
table['Cat'] = pd.qcut(table['Coef'], 9, labels=[0, 1, 2, 3, 4, 5, 6, 7, 8])

print(table)

# Create a map
my_map = folium.Map()

# Add the data
folium.Choropleth(
    geo_data=table,
    name='choropleth',
    data=table,
    columns=['Country', 'Cat'],
    key_on='feature.properties.name',
    fill_color='YlGn',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Growth of GDP since 1990',
    threshold_scale=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
).add_to(my_map)
my_map.save('gdp_growth.html')

There is a twist in the way it is done. Instead of using a linear model to represent the growth rate on the map, we chose to add them in categories. The reason is that otherwise most countries group in small segment.

Here we have used the qcut to add them in each equal sized group.

This should result in an interactive html page looking something like this.

End result.

How to Create Choropleth Maps with Evenly Distributed Colors in 3 Easy Steps

What will we cover in the tutorial

  • You create your default map using Choropleth from folium (or any other) and the color distribution is poor. Most countries are distributed between two colors, which makes the map less informative.
Example of poor color distribution.
  • Understand the issue
  • How to solve it
  • …and putting it all together.

Step 1: Understand the issue

In this example we have used the divorce rates in reported in various countries on wikipedia’s page of Divorce Demography

At first inspection on the wikipedia page you get an idea of what the problem is.

From wikipedia.org

The divorce rate in Denmark is too high (just kidding, I am from Denmark and not proud that Denmark is ranking number 6 based on percentage of marriages that end in divorce).

The issue is, that the distribution is not even. See the highest is Tunisia with 97.14 percent, then Portugal is second with 70.97 percent. This has to be compared to the color coding. The last two colors are by default distributed to 67 to 82 and 82 to 97, which only contain 1 country each.

To inspect this further, we need to retrieve the data and inspect it.

To retrieve the data we can use pandas – read this tutorial for details or see the code below.

import pandas as pd

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/Divorce_demography'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# The data is in the second table
table = tables[0]
# We need to remove level 0 of columns as they are disturbing the data
table.columns = table.columns.droplevel(0)

def is_float(str):
    try:
        float(str)
        return True
    except:
        return False

# We need to convert the data to floats
index = 'Divorce_float'
table[index] = table.apply(lambda row: float(row['Percent']) if is_float(row['Percent']) else np.nan, axis=1)

print(pd.cut(table[index], 6).value_counts(sort=False))

If we inspect the output we see that our suggestion was right.

(6.93, 22.04]     25
(22.04, 37.06]    26
(37.06, 52.08]    22
(52.08, 67.1]      9
(67.1, 82.12]      1
(82.12, 97.14]     1
Name: Divorce_float, dtype: int64
The last two color codes are only used by one country each, while the first 3 are used by 20+ countries.

Step 2: Distribute the countries into evenly distributed bins

This requires to understand the difference between cut and qcut of pandas library.

  • cut By default will return the same size bins.
  • qcut Will by default try to return buckets with the same number of items in.

See this example to understand it better.

import pandas as pd

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/Divorce_demography'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# The data is in the second table
table = tables[0]
# We need to remove level 0 of columns as they are disturbing the data
table.columns = table.columns.droplevel(0)

def is_float(str):
    try:
        float(str)
        return True
    except:
        return False

# We need to convert the data to floats
index = 'Divorce_float'
table[index] = table.apply(lambda row: float(row['Percent']) if is_float(row['Percent']) else np.nan, axis=1)

print(pd.qcut(table[index], 6).value_counts(sort=False))

Where the only difference is that we changed cut to qcut on the last line. This will result in the following output.

(7.018999999999999, 17.303]    14
(17.303, 23.957]               14
(23.957, 31.965]               14
(31.965, 40.0]                 15
(40.0, 47.078]                 13
(47.078, 97.14]                14
Name: Divorce_float, dtype: int64

Where we see that each bucket now contains approximately the same number of countries.

Hence, we need to use that for our purpose of color distribution our map.

Step 3: Putting it all together on the map

If you are new to folium and how make awesome leaflet maps easy, I can recommend to read this tutorial, or inspect the code below.

import pandas as pd
import folium
import geopandas
import numpy as np

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/Divorce_demography'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# The data is in the second table
table = tables[0]
# We need to remove level 0 of columns as they are disturbing the data
table.columns = table.columns.droplevel(0)

# We should clean the data
table['Country'] = table.apply(lambda row: row['Country/region'].split(' (')[0] if type(row['Country/region']) == str else row['Country/region'], axis=1)

# Read the geopandas dataset
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
# Replace United States of America to United States to fit the naming in the table
world = world.replace('United States of America', 'United States')

# Merge the two DataFrames together
table = world.merge(table, how="left", left_on=['name'], right_on=['Country'])

def is_float(str):
    try:
        float(str)
        return True
    except:
        return False

# We need to convert the data to floats
index = 'Divorce_float'
table[index] = table.apply(lambda row: float(row['Percent']) if is_float(row['Percent']) else np.nan, axis=1)

# Clean data: remove rows with no data
table = table.dropna(subset=[index])

# We have 10 colors available resulting into 9 cuts.
bins_data = pd.qcut(table[index], 9).value_counts(sort=False)
print(bins_data)

bins = [0]
for i in range(9):
    bins.append(int(round(bins_data.index.values[i].right)))
bins[9] = 100

# Create a map
my_map = folium.Map()

# Add the data
folium.Choropleth(
    geo_data=table,
    name='choropleth',
    data=table,
    columns=['Country', index],
    key_on='feature.properties.name',
    fill_color='OrRd',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name=index,
    threshold_scale=bins
).add_to(my_map)
my_map.save('divorse_rates.html')

Where we combine the two DataFrames and take advantage of that we have 10 colors available.

It should result in a map like this one.

Final output

Plot World Data to Map Using Python in 3 Easy Steps

What will we cover in this tutorial

  • As example we will use the html table from a wikipedia page. In this case the one listing countries by meat consumption.
  • We will see how to read the table data into a Pandas DataFrame with a single call.
  • Then how to merge it with a DataFrame containing data to color countries.
  • Finally, how to add the colors to leaflet map using a Python library.

Step 1: Read the data to a Pandas DataFrame

We need to inspect the page we are going to parse from. In this case it is the world meat consumption from wikipedia.

From wikipedia.

What we want to do is to gather the data from the table and plot it to a world map using colors to indicate the meat consumption.

End result

The easiest way to work with data is by using pandas DataFrames. The Pandas library has a read_html function, which returns all tables from a webpage.

This can be achieved by the following code. If you use read_html for the first time, you will need to instal lxml, see this tutorial for details.

import pandas as pd

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# The data is in the first table - this changes from time to time - wikipedia is updated all the time.
table = tables[0]

print(table.head())

Resulting in the following output.

               Country  Kg/person (2002)[9][note 1] Kg/person (2009)[10]
0              Albania                         38.2                  NaN
1              Algeria                         18.3                 19.5
2       American Samoa                         24.9                 26.8
3               Angola                         19.0                 22.4
4  Antigua and Barbuda                         56.0                 84.3

Step 2: Merging the data to world map

The next step thing we want to do is to map it to a world map that we can color.

This can be done by using geopandas.

import pandas as pd
import geopandas


# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# The data is in the first table - this changes from time to time - wikipedia is updated all the time.
table = tables[0]

print(table.head())

# Read the geopandas dataset
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))

print(world.head())

Which results in the following output.

               Country  Kg/person (2002)[9][note 1] Kg/person (2009)[10]
0              Albania                         38.2                  NaN
1              Algeria                         18.3                 19.5
2       American Samoa                         24.9                 26.8
3               Angola                         19.0                 22.4
4  Antigua and Barbuda                         56.0                 84.3
     pop_est      continent                      name iso_a3  gdp_md_est                                           geometry
0     920938        Oceania                      Fiji    FJI      8374.0  MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1   53950935         Africa                  Tanzania    TZA    150600.0  POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2     603253         Africa                 W. Sahara    ESH       906.5  POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
3   35623680  North America                    Canada    CAN   1674000.0  MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
4  326625791  North America  United States of America    USA  18560000.0  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...

Where we can see the column Country of the table DataFrame should be merged with the column name in the world DataFrame.

Let’s do the merge on that.

import pandas as pd
import geopandas


# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# The data is in the first table - this changes from time to time - wikipedia is updated all the time.
table = tables[0]

# Read the geopandas dataset
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))

# Merge the two DataFrames together
table = world.merge(table, how="left", left_on=['name'], right_on=['Country'])

print(table.head())

Which results in the following output.

     pop_est      continent  ... kg/person (2009)[10] kg/person (2017)[11]
0     920938        Oceania  ...                 38.8                  NaN
1   53950935         Africa  ...                  9.6                 6.82
2     603253         Africa  ...                  NaN                  NaN
3   35623680  North America  ...                 94.3                69.99
4  326625791  North America  ...                120.2                98.60

[5 rows x 10 columns]

Where we also notice that some rows do not have any data from table, resulting in values NaN. To get a clearer view we will remove those rows.

import pandas as pd
import geopandas


# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# The data is in the first table - this changes from time to time - wikipedia is updated all the time.
table = tables[0]

# Read the geopandas dataset
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))

# Merge the two DataFrames together
table = world.merge(table, how="left", left_on=['name'], right_on=['Country'])

# Clean data: remove rows with no data
table = table.dropna(subset=['kg/person (2002)[9][note 1]'])

The rows can be removed by using dropna.

Step 3: Add the data by colors on an interactive world map

Finally, you can use folium to create a leaflet map.

import pandas as pd
import folium
import geopandas


# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# The data is in the first table - this changes from time to time - wikipedia is updated all the time.
table = tables[0]

# Read the geopandas dataset
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))

# Merge the two DataFrames together
table = world.merge(table, how="left", left_on=['name'], right_on=['Country'])

# Clean data: remove rows with no data
table = table.dropna(subset=['kg/person (2002)[9][note 1]'])

# Create a map
my_map = folium.Map()

# Add the data
folium.Choropleth(
    geo_data=table,
    name='choropleth',
    data=table,
    columns=['Country', 'kg/person (2002)[9][note 1]'],
    key_on='feature.properties.name',
    fill_color='OrRd',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Meat consumption in kg/person'
).add_to(my_map)
my_map.save('meat.html')

Resulting a html webpage like this one.