Pandas: Explore Datasets by Visualization – Exploring the Holland Code (RIASEC) Test – Part II

What will we cover in this tutorial?

We will continue our journey to explore a big dataset of 145,000+ respondents to a RIASEC test. If you want to explore the full journey, we recommend you read this tutorial first.

In this tutorial we will find some data points that are not correct and a potential way to deal with it.

Step 1: Explore the family sizes from the respondents

In the first tutorial we looked at how the respondent were distributed around the world. Surprisingly, most countries were represented.

From previous tutorial.

In this we will explore the dataset further. The dataset is available here.

import pandas as pd

# Only to get a broader summary
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 30)
pd.set_option('display.width', 1000)


data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
print(data)

Which will output the following.

        R1  R2  R3  R4  R5  R6  R7  R8  I1  I2  I3  I4  I5  I6  I7  ...  gender  engnat  age  hand  religion  orientation  race  voted  married  familysize  uniqueNetworkLocation  country  source                major  Unnamed: 93
0        3   4   3   1   1   4   1   3   5   5   4   3   4   5   4  ...       1       1   14     1         7            1     1      2        1           1                      1       US       2                  NaN          NaN
1        1   1   2   4   1   2   2   1   5   5   5   4   4   4   4  ...       1       1   29     1         7            3     4      1        2           3                      1       US       1              Nursing          NaN
2        2   1   1   1   1   1   1   1   4   1   1   1   1   1   1  ...       2       1   23     1         7            1     4      2        1           1                      1       US       1                  NaN          NaN
3        3   1   1   2   2   2   2   2   4   1   2   4   3   2   3  ...       2       2   17     1         0            1     1      2        1           1                      1       CN       0                  NaN          NaN
4        4   1   1   2   1   1   1   2   5   5   5   3   5   5   5  ...       2       2   18     1         4            3     1      2        1           4                      1       PH       0            education          NaN

If you use the slider, I got curious about how family sizes vary around the world. This dataset is obviously not representing any conclusive data on it, but it could be interesting to see if there is any connection to where you are located in the world and family size.

Step 2: Explore the distribution of family sizes

What often happens in dataset is there might be inaccurate data.

To get a feeling of the data in the column familysize, you can explore it by running this.

import pandas as pd


data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)

print(data['familysize'].describe())
print(pd.cut(data['familysize'], bins=[0,1,2,3,4,5,6,7,10,100, 1000000000]).value_counts())

Resulting in the following from the describe output.

count    1.458280e+05
mean     1.255801e+05
std      1.612271e+07
min      0.000000e+00
25%      2.000000e+00
50%      3.000000e+00
75%      3.000000e+00
max      2.147484e+09
Name: familysize, dtype: float64

Where the mean value of family size is 125,580. Well, maybe we don’t count family size the same way, but something is wrong there.

Grouping the data into bins (by using the cut function combined with value_count) you get this output.

(1, 2]               51664
(2, 3]               38653
(3, 4]               18729
(0, 1]               15901
(4, 5]                8265
(5, 6]                3932
(6, 7]                1928
(7, 10]               1904
(10, 100]              520
(100, 1000000000]       23
Name: familysize, dtype: int64

Which indicates 23 families of size greater than 100. Let’s just investigate the sizes in that bucket.

print(data[data['familysize'] > 100]['familysize'])

Giving us this output.

1212      2147483647
3114      2147483647
5770      2147483647
8524             104
9701             103
21255     2147483647
24003            999
26247     2147483647
27782     2147483647
31451           9999
39294           9045
39298          84579
49033            900
54592            232
58773     2147483647
74745      999999999
78643            123
92457            999
95916            908
102680           666
109429           989
111488       9234785
120489          5000
120505     123456789
122580          5000
137141           394
139226          3425
140377           934
142870    2147483647
145686           377
145706           666
Name: familysize, dtype: int64

The integer 2147483647 is interesting as it is the maximum 32-bit positive integer. I think it is safe to say that most family sizes given above 100 are not realistic.

Step 3: Clean the data

You need to make a decision on these data points that seem to skew your data in a wrong way.

Say, you just decide to visualize it without any adjustment, it would give a misrepresentative picture.

Iceland? What’s up?

It seems like Iceland has a tradition for big families.

Let’s investigate that.

print(data[data['country'] == 'IS']['familysize'])

Interestingly it give only one line that does not seem correct.

74745     999999999

But as there are only a few respondents the average is the highest.

To clean the data fully, we can make the decision that family sizes above 10 are not correct. I know, that might be set a bit low and you can choose to do something different.

Cleaning the data is simple.

data = data[data['familysize'] < 10]

Magic right? You simply write a conditional that will be vectorized down and only keep those rows of data that fulfill this condition.

Step 4: Visualize the data

We will use geopandas, matplotlib and pycountry to visualize it. The process is similar to the one in previous tutorial where you can find more details.

import geopandas
import pandas as pd
import matplotlib.pyplot as plt
import pycountry

# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
    try:
        return pycountry.countries.lookup(country).alpha_3
    except LookupError:
        return country


data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)


data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)
data = data[data['familysize'] < 10]

country_mean = data.groupby(['alpha3']).mean()

world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
map = world.merge(country_mean, how='left', left_on=['iso_a3'], right_on=['alpha3'])
map.plot('familysize', figsize=(12,4), legend=True)
plt.show()

Resulting in the following output.

Family sizes of the respondents

Looks like there is a one-child policy in China? Again, do not make any conclusions on this data as it is very narrow of this aspect.

Read the next part here:

Master Data Correlation with Pandas DataFrame in 3 Easy Steps

What will we cover in this tutorial?

  • How to get data using Pandas DataFrames.
  • Clean the data and merge it together.
  • Finally, how to see if there is any correlation between data columns.

Step 1: Get the data you want to correlate

As an example, let’s assume you get the idea that there might be a correlation between GDP per capita, Social Progress Index (SPI), and Human Development Index (HDI), but is not sure whether SPI or HDI is closets correlated to GDP per capita.

Luckily, you have pandas to the rescue.

As the data is in three pages, you need to collect it by separately and merge it later. First, let us collect the data and inspect it.

The GDP per capita is located in the table on wikipedia presented in the picture below.

From wikipedia.org

Which is actually three tables. We will use the World Bank table in our example. It can be collected by using a call to pandas read_html. If you are new to read_html we recommend you read this tutorial.

import pandas as pd

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

table = tables[3]
print(table)

Which will give an output similar to this.

    Rank                       Country/Territory     US$
0      1                           Monaco (2018)  185741
1      2                    Liechtenstein (2017)  173356
2      3                              Luxembourg  114705
3      —                                   Macau   84096
4      4                             Switzerland   81994
5      5                                 Ireland   78661
6      6                                  Norway   75420
7      7                                 Iceland   66945

The next table we need to get is the Social Progress Index (SPI) and looks like the picture shows below.

From wikipedia.org

This can be collected to a DataFrame with the following code.

import pandas as pd

url = 'https://en.wikipedia.org/wiki/Social_Progress_Index'
tables = pd.read_html(url)
print(tables[1])

Which will print the following to the screen (or the top of it).

                          Country 2019[9]               
                          Country    Rank  Score Score.1
0                          Norway       1  90.95     NaN
1                         Denmark       2  90.09     NaN
2                     Switzerland       3  89.89     NaN
3                         Finland       4  89.56     NaN
4                          Sweden       5  89.45     NaN

Finally we need to read the Human Development Index (HDI), which can be seen on wikipedia as the following picture shows.

From wikipedia.org

And can be collected with the following code.

import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index'
tables = pd.read_html(url)

print(tables[1])

Resulting in the following output.

                           Rank                                             Country or Territory                         HDI                                                    Unnamed: 5_level_0
    2018 data (2019 report)​[5] Change from previous year​[5]               Country or Territory 2018 data (2019 report)​[5] Average annual HDI growth (2010–2018)​[6]          Unnamed: 5_level_1
    Very high human development   Very high human development        Very high human development Very high human development               Very high human development Very high human development
0                             1                           NaN                             Norway                       0.954                                     0.16%                         NaN
1                             2                           NaN                        Switzerland                       0.946                                     0.18%                         NaN
2                             3                           NaN                            Ireland                       0.942                                     0.71%                         NaN
3                             4                           NaN                            Germany                       0.939                                     0.25%                         NaN
4                             4                           (2)                          Hong Kong                       0.939                                     0.51%                         NaN
5                             6                           (1)                          Australia                       0.938                                     0.17%                         NaN

A bit more messy data table.

Now we have gathered all the data we need to clean it up and merge it together.

Step 2: Clean and merge the data into one DataFrame

If we first inspect the data from the GDP per capita.

    Rank                       Country/Territory     US$
0      1                           Monaco (2018)  185741
1      2                    Liechtenstein (2017)  173356
2      3                              Luxembourg  114705
3      —                                   Macau   84096
4      4                             Switzerland   81994
5      5                                 Ireland   78661
6      6                                  Norway   75420
7      7                                 Iceland   66945

Notice that Country/Territory can have a year in parentheses, which will make it hard to merge. We need to clean that up. Also, we do not need the Rank column.

If we inspect the data of HDI.

                          Country 2019[9]               
                          Country    Rank  Score Score.1
0                          Norway       1  90.95     NaN
1                         Denmark       2  90.09     NaN
2                     Switzerland       3  89.89     NaN
3                         Finland       4  89.56     NaN
4                          Sweden       5  89.45     NaN

Here we notice that the first row is an additional description row, which we can remove. Further, we do not need the Rank and Score.1 columns.

Let’s try to merge it together. Notice that we use a lambda function to clean up the Country/Territory names. If you are new to lambda functions, we recommend you read this tutorial.

import pandas as pd

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# The data is in table 3
table = tables[3]

# We need to clean the years in parenthesis from the country/territory field
table['Country'] = table.apply(lambda row: row['Country/Territory'].split(' (')[0], axis=1)
# We do not need the Rank and Country/Territory for more data
table = table.drop(['Rank', 'Country/Territory'], axis=1)

url = 'https://en.wikipedia.org/wiki/Social_Progress_Index'
tables = pd.read_html(url)

merge_table = tables[1]
# The first level of the table can be dropped
merge_table.columns = merge_table.columns.droplevel(0)
# We do not need the Rank and Score.1 columns
merge_table = merge_table.drop(['Rank', 'Score.1'], axis=1)
# Need to rename the second column to SPI = Social Progress Index
merge_table.columns = ['Country', 'SPI']

# Ready to merge the tables
table = table.merge(merge_table, how="left", left_on=['Country'], right_on=['Country'])

print(table)

Which will result in an output like this.

        US$                            Country    SPI
0    185741                             Monaco    NaN
1    173356                      Liechtenstein    NaN
2    114705                         Luxembourg  87.66
3     84096                              Macau    NaN
4     81994                        Switzerland  89.89
5     78661                            Ireland  87.97
6     75420                             Norway  90.95

First validate that Monaco, Liechtenstein, and Macau do not have any SPI value. That seems to be correct.

Then we can proceed to the next table of HDI. Let us first inspect the data.

                           Rank                                             Country or Territory                         HDI                                                    Unnamed: 5_level_0
    2018 data (2019 report)​[5] Change from previous year​[5]               Country or Territory 2018 data (2019 report)​[5] Average annual HDI growth (2010–2018)​[6]          Unnamed: 5_level_1
    Very high human development   Very high human development        Very high human development Very high human development               Very high human development Very high human development
0                             1                           NaN                             Norway                       0.954                                     0.16%                         NaN
1                             2                           NaN                        Switzerland                       0.946                                     0.18%                         NaN
2                             3                           NaN                            Ireland                       0.942                                     0.71%                         NaN
3                             4                           NaN                            Germany                       0.939                                     0.25%                         NaN
4                             4                           (2)                          Hong Kong                       0.939                                     0.51%                         NaN
5                             6                           (1)                          Australia                       0.938                                     0.17%                         NaN

It has a quite messy top level column naming in 3 layers. Dropping them will make some identical. To deal with that, we can rename them and delete those we do not need.

import pandas as pd

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# The data is in table 3
table = tables[3]

# We need to clean the years in parenthesis from the country/territory field
table['Country'] = table.apply(lambda row: row['Country/Territory'].split(' (')[0], axis=1)
# We do not need the Rank and Country/Territory for more data
table = table.drop(['Rank', 'Country/Territory'], axis=1)

url = 'https://en.wikipedia.org/wiki/Social_Progress_Index'
tables = pd.read_html(url)

merge_table = tables[1]
# The first level of the table can be dropped
merge_table.columns = merge_table.columns.droplevel(0)
# We do not need the Rank and Score.1 columns
merge_table = merge_table.drop(['Rank', 'Score.1'], axis=1)
# Need to rename the second column to SPI = Social Progress Index
merge_table.columns = ['Country', 'SPI']

# Ready to merge the tables
table = table.merge(merge_table, how="left", left_on=['Country'], right_on=['Country'])

url = 'https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index'
tables = pd.read_html(url)


merge_table = tables[1]
# Delete the additional column levels
merge_table.columns = merge_table.columns.droplevel(1)
merge_table.columns = merge_table.columns.droplevel(1)
# Rename the columns
merge_table.columns = ['Rank1', 'Rank2', 'Country', 'HDI', 'HDI-1', 'None']
# Delete the columns we do not need
merge_table = merge_table.drop(['Rank1', 'Rank2', 'HDI-1', 'None'], axis=1)
# Merge the tables
table = table.merge(merge_table, how="left", left_on=['Country'], right_on=['Country'])

print(table)

Which will result in the following output (or the top of it).

        US$                            Country    SPI    HDI
0    185741                             Monaco    NaN    NaN
1    173356                      Liechtenstein    NaN  0.917
2    114705                         Luxembourg  87.66  0.909
3     84096                              Macau    NaN    NaN
4     81994                        Switzerland  89.89  0.946
5     78661                            Ireland  87.97  0.942
6     75420                             Norway  90.95  0.954

Notice, that here Liechtenstein has HDI data, while Monaco and Macau do not have any data. While it is not visible, the HDI column is not made of float. It can be seen with a call to table.dtypes, which will output the following.

US$          int64
Country     object
SPI        float64
HDI         object
dtype: object

Which states that HDI is object, which in this case is a string. That means we need to convert it to float to make our final correlation computations. This can be done by using a lambda function.

table['HDI'] = table.apply(lambda row: float(row['HDI']) if row['HDI'] is not np.nan else np.nan, axis=1) # HDI = Human Development Index

This actually makes the data ready to see if there is any correlations between GDP per capita and SPI and/or HDI.

Step 3: Calculate the correlations

This is where the DataFrames from pandas come strong. It can do the entire work for you with one call to corr().

The full code is given below.

import pandas as pd
import numpy as np


pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 1000)


# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# The data is in table 3
table = tables[3]

# We need to clean the years in parenthesis from the country/territory field
table['Country'] = table.apply(lambda row: row['Country/Territory'].split(' (')[0], axis=1)
# We do not need the Rank and Country/Territory for more data
table = table.drop(['Rank', 'Country/Territory'], axis=1)

url = 'https://en.wikipedia.org/wiki/Social_Progress_Index'
tables = pd.read_html(url)

merge_table = tables[1]
# The first level of the table can be dropped
merge_table.columns = merge_table.columns.droplevel(0)
# We do not need the Rank and Score.1 columns
merge_table = merge_table.drop(['Rank', 'Score.1'], axis=1)
# Need to rename the second column to SPI = Social Progress Index
merge_table.columns = ['Country', 'SPI']

# Ready to merge the tables
table = table.merge(merge_table, how="left", left_on=['Country'], right_on=['Country'])

url = 'https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index'
tables = pd.read_html(url)


merge_table = tables[1]
# Delete the additional column levels
merge_table.columns = merge_table.columns.droplevel(1)
merge_table.columns = merge_table.columns.droplevel(1)
# Rename the columns
merge_table.columns = ['Rank1', 'Rank2', 'Country', 'HDI', 'HDI-1', 'None']
# Delete the columns we do not need
merge_table = merge_table.drop(['Rank1', 'Rank2', 'HDI-1', 'None'], axis=1)
# Merge the tables
table = table.merge(merge_table, how="left", left_on=['Country'], right_on=['Country'])

# Convert to floats
table['HDI'] = table.apply(lambda row: float(row['HDI']) if row['HDI'] is not np.nan else np.nan, axis=1) # HDI = Human Development Index

# Calculate the correlation
table_corr = table.corr()

# Print the correlation to GDP per capita (stored in US$).
print(table_corr['US$'].sort_values(ascending=False))

Which will result in the following output.

US$    1.000000
SPI    0.713946
HDI    0.663183
Name: US$, dtype: float64

Hence, it seems that there is the biggest correlation between GDP per capita and SPI.

Notice, that the calculations ignores all Not a Number (np.nan).

Pandas and Folium: Categorize GDP Growth by Country and Visualize on Map in 3 Easy Steps

What will we cover in this tutorial?

  • We will gather data from wikipedia.org List of countries by past and projected GDP using pandas.
  • First step will be get the data and merge the correct tables together.
  • Next step is using Machine Learning with Linear regression model to estimate the growth of each country GDP.
  • Final step is to visualize the growth rates on a leaflet map using folium.

Step 1: Get the data and merge it

The data is available on wikipedia on List of countries by past and projected GDP. We will focus on data from 1990 to 2019.

At first glance on the page you notice that the date is not gathered in one table.

From wikipedia.org

The first task will be to merge the three tables with the data from 1990-1999, 2000-2009, and 2010-2019.

The data can be collected by pandas read_html function. If you are new to this you can read this tutorial.

import pandas as pd

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])

print(table)

The call to read_html will return all the tables in a list. By inspecting the results you will notice that we are interested in table 9, 12 and 15 and merge them. The output of the above will be.

     Country (or dependent territory)       1990       1991       1992       1993       1994       1995       1996       1997       1998       1999        2000        2001        2002        2003        2004        2005        2006        2007        2008        2009        2010        2011        2012        2013        2014        2015        2016        2017        2018        2019
0                         Afghanistan        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN         NaN         NaN      4367.0      4514.0      5146.0      6167.0      6925.0      8556.0     10297.0     12066.0     15325.0     17890.0     20296.0     20170.0     20352.0     19687.0     19454.0     20235.0     19585.0     19990.0
1                             Albania     2221.0     1333.0      843.0     1461.0     2361.0     2882.0     3200.0     2259.0     2560.0     3209.0      3483.0      3928.0      4348.0      5611.0      7185.0      8052.0      8905.0     10675.0     12901.0     12093.0     11938.0     12896.0     12323.0     12784.0     13238.0     11393.0     11865.0     13055.0     15202.0     15960.0
2                             Algeria    61892.0    46670.0    49217.0    50963.0    42426.0    42066.0    46941.0    48178.0    48188.0    48845.0     54749.0     54745.0     56761.0     67864.0     85327.0    103198.0    117027.0    134977.0    171001.0    137054.0    161207.0    199394.0    209005.0    209703.0    213518.0    164779.0    159049.0    167555.0    180441.0    183687.0
3                              Angola    11236.0    10891.0     8398.0     6095.0     4438.0     5539.0     6535.0     7675.0     6506.0     6153.0      9130.0      8936.0     12497.0     14189.0     19641.0     28234.0     41789.0     60449.0     84178.0     75492.0     82471.0    104116.0    115342.0    124912.0    126777.0    102962.0     95337.0    122124.0    107316.0     92191.0
4                 Antigua and Barbuda      459.0      482.0      499.0      535.0      589.0      577.0      634.0      681.0      728.0      766.0       825.0       796.0       810.0       850.0       912.0      1013.0      1147.0      1299.0      1358.0      1216.0      1146.0      1140.0      1214.0      1194.0      1273.0      1353.0      1460.0      1516.0      1626.0      1717.0
5                           Argentina   153205.0   205515.0   247987.0   256365.0   279150.0   280080.0   295120.0   317549.0   324242.0   307673.0    308491.0    291738.0    108731.0    138151.0    164922.0    199273.0    232892.0    287920.0    363545.0    334633.0    424728.0    527644.0    579666.0    611471.0    563614.0    631621.0    554107.0    642928.0    518092.0    477743.0
6                             Armenia        NaN        NaN      108.0      835.0      648.0     1287.0     1597.0     1639.0     1892.0     1845.0      1912.0      2118.0      2376.0      2807.0      3577.0      4900.0      6384.0      9206.0     11662.0      8648.0      9260.0     10142.0     10619.0     11121.0     11610.0     10529.0     10572.0     11537.0     12411.0     13105.0

Step 2: Use linear regression to estimate the growth over the last 30 years

In this section we will use Linear regression from the scikit-learn library, which is a simple prediction tool.

If you are new to Machine Learning we recommend you read this tutorial on Linear regression.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

import numpy as np

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])

row = table.iloc[1]
X = table.columns[1:].to_numpy().reshape(-1, 1)
X = X.astype(int)
Y = 1 + row.iloc[1:].pct_change()
Y = Y.cumprod().fillna(1.0).to_numpy()
Y = Y.reshape(-1, 1)

regr = LinearRegression()
regr.fit(X, Y)

Y_pred = regr.predict(X)

plt.scatter(X, Y)
plt.plot(X, Y_pred, color='red')
plt.show()

Which will result in the following plot.

Linear regression model applied on data from wikipedia.org

Which shows that the model approximates a line through the 30 years of data to estimate the growth of the country’s GDP.

Notice that we use the product (cumprod) of pct_change to be able to compare the data. If we used the data directly, we would not be possible to compare it.

We will do that for all countries to get a view of the growth. We are using the coefficient of the line, which indicates the growth rate.

import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])

coef = []
countries = []

for index, row in table.iterrows():
    #print(row)
    X = table.columns[1:].to_numpy().reshape(-1, 1)
    X = X.astype(int)
    Y = 1 + row.iloc[1:].pct_change()
    Y = Y.cumprod().fillna(1.0).to_numpy()
    Y = Y.reshape(-1, 1)

    regr = LinearRegression()
    regr.fit(X, Y)

    coef.append(regr.coef_[0][0])
    countries.append(row[merge_index])

data = pd.DataFrame(list(zip(countries, coef)), columns=['Country', 'Coef'])

print(data)

Which results in the following output (or the first few lines).

                              Country      Coef
0                         Afghanistan  0.161847
1                             Albania  0.243493
2                             Algeria  0.103907
3                              Angola  0.423919
4                 Antigua and Barbuda  0.087863
5                           Argentina  0.090837
6                             Armenia  4.699598

Step 3: Merge the data to a leaflet map using folium

The last step is to merge the data together with the leaflet map using the folium library. If you are new to folium we recommend you read this tutorial.

import pandas as pd
import folium
import geopandas
from sklearn.linear_model import LinearRegression
import numpy as np

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])

coef = []
countries = []

for index, row in table.iterrows():
    X = table.columns[1:].to_numpy().reshape(-1, 1)
    X = X.astype(int)
    Y = 1 + row.iloc[1:].pct_change()
    Y = Y.cumprod().fillna(1.0).to_numpy()
    Y = Y.reshape(-1, 1)

    regr = LinearRegression()
    regr.fit(X, Y)

    coef.append(regr.coef_[0][0])
    countries.append(row[merge_index])

data = pd.DataFrame(list(zip(countries, coef)), columns=['Country', 'Coef'])

# Read the geopandas dataset
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
# Replace United States of America to United States to fit the naming in the table
world = world.replace('United States of America', 'United States')

# Merge the two DataFrames together
table = world.merge(data, how="left", left_on=['name'], right_on=['Country'])


# Clean data: remove rows with no data
table = table.dropna(subset=['Coef'])

# We have 10 colors available resulting into 9 cuts.
table['Cat'] = pd.qcut(table['Coef'], 9, labels=[0, 1, 2, 3, 4, 5, 6, 7, 8])

print(table)

# Create a map
my_map = folium.Map()

# Add the data
folium.Choropleth(
    geo_data=table,
    name='choropleth',
    data=table,
    columns=['Country', 'Cat'],
    key_on='feature.properties.name',
    fill_color='YlGn',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Growth of GDP since 1990',
    threshold_scale=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
).add_to(my_map)
my_map.save('gdp_growth.html')

There is a twist in the way it is done. Instead of using a linear model to represent the growth rate on the map, we chose to add them in categories. The reason is that otherwise most countries group in small segment.

Here we have used the qcut to add them in each equal sized group.

This should result in an interactive html page looking something like this.

End result.

Plot World Data to Map Using Python in 3 Easy Steps

What will we cover in this tutorial

  • As example we will use the html table from a wikipedia page. In this case the one listing countries by meat consumption.
  • We will see how to read the table data into a Pandas DataFrame with a single call.
  • Then how to merge it with a DataFrame containing data to color countries.
  • Finally, how to add the colors to leaflet map using a Python library.

Step 1: Read the data to a Pandas DataFrame

We need to inspect the page we are going to parse from. In this case it is the world meat consumption from wikipedia.

From wikipedia.

What we want to do is to gather the data from the table and plot it to a world map using colors to indicate the meat consumption.

End result

The easiest way to work with data is by using pandas DataFrames. The Pandas library has a read_html function, which returns all tables from a webpage.

This can be achieved by the following code. If you use read_html for the first time, you will need to instal lxml, see this tutorial for details.

import pandas as pd

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# The data is in the first table - this changes from time to time - wikipedia is updated all the time.
table = tables[0]

print(table.head())

Resulting in the following output.

               Country  Kg/person (2002)[9][note 1] Kg/person (2009)[10]
0              Albania                         38.2                  NaN
1              Algeria                         18.3                 19.5
2       American Samoa                         24.9                 26.8
3               Angola                         19.0                 22.4
4  Antigua and Barbuda                         56.0                 84.3

Step 2: Merging the data to world map

The next step thing we want to do is to map it to a world map that we can color.

This can be done by using geopandas.

import pandas as pd
import geopandas


# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# The data is in the first table - this changes from time to time - wikipedia is updated all the time.
table = tables[0]

print(table.head())

# Read the geopandas dataset
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))

print(world.head())

Which results in the following output.

               Country  Kg/person (2002)[9][note 1] Kg/person (2009)[10]
0              Albania                         38.2                  NaN
1              Algeria                         18.3                 19.5
2       American Samoa                         24.9                 26.8
3               Angola                         19.0                 22.4
4  Antigua and Barbuda                         56.0                 84.3
     pop_est      continent                      name iso_a3  gdp_md_est                                           geometry
0     920938        Oceania                      Fiji    FJI      8374.0  MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1   53950935         Africa                  Tanzania    TZA    150600.0  POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2     603253         Africa                 W. Sahara    ESH       906.5  POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
3   35623680  North America                    Canada    CAN   1674000.0  MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
4  326625791  North America  United States of America    USA  18560000.0  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...

Where we can see the column Country of the table DataFrame should be merged with the column name in the world DataFrame.

Let’s do the merge on that.

import pandas as pd
import geopandas


# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# The data is in the first table - this changes from time to time - wikipedia is updated all the time.
table = tables[0]

# Read the geopandas dataset
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))

# Merge the two DataFrames together
table = world.merge(table, how="left", left_on=['name'], right_on=['Country'])

print(table.head())

Which results in the following output.

     pop_est      continent  ... kg/person (2009)[10] kg/person (2017)[11]
0     920938        Oceania  ...                 38.8                  NaN
1   53950935         Africa  ...                  9.6                 6.82
2     603253         Africa  ...                  NaN                  NaN
3   35623680  North America  ...                 94.3                69.99
4  326625791  North America  ...                120.2                98.60

[5 rows x 10 columns]

Where we also notice that some rows do not have any data from table, resulting in values NaN. To get a clearer view we will remove those rows.

import pandas as pd
import geopandas


# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# The data is in the first table - this changes from time to time - wikipedia is updated all the time.
table = tables[0]

# Read the geopandas dataset
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))

# Merge the two DataFrames together
table = world.merge(table, how="left", left_on=['name'], right_on=['Country'])

# Clean data: remove rows with no data
table = table.dropna(subset=['kg/person (2002)[9][note 1]'])

The rows can be removed by using dropna.

Step 3: Add the data by colors on an interactive world map

Finally, you can use folium to create a leaflet map.

import pandas as pd
import folium
import geopandas


# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# The data is in the first table - this changes from time to time - wikipedia is updated all the time.
table = tables[0]

# Read the geopandas dataset
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))

# Merge the two DataFrames together
table = world.merge(table, how="left", left_on=['name'], right_on=['Country'])

# Clean data: remove rows with no data
table = table.dropna(subset=['kg/person (2002)[9][note 1]'])

# Create a map
my_map = folium.Map()

# Add the data
folium.Choropleth(
    geo_data=table,
    name='choropleth',
    data=table,
    columns=['Country', 'kg/person (2002)[9][note 1]'],
    key_on='feature.properties.name',
    fill_color='OrRd',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Meat consumption in kg/person'
).add_to(my_map)
my_map.save('meat.html')

Resulting a html webpage like this one.