Pandas Correlation Methods Explained: Pearson, Kendall, and Spearman

What will we cover in this tutorial?

In this tutorial we will on a live example investigate and understand the differences between the 3 methods to calculate correlation using Pandas DataFrame corr() function.

The purpose of this tutorial is to get a better understanding of these correlations, while working on real data.

Step 1: Getting some data to play with

The data we want to investigate for correlations is the US GDP, S&P 500, Gold and Oil prices. We will only focus on recent time (from 2000-2020), as the prices for Gold and Oil are not available further back on Yahoo! finance. We will get the US GDP from World Bank and the rest from Yahoo! finance.

We will be using the Pandas-datareader to retrieve the data. For a more in-depth introduction to how to use them, we will refer you to this tutorial.

import pandas_datareader as pdf
import datetime as dt
from pandas_datareader import wb


start = dt.datetime(2000, 1, 1)
end = dt.datetime.now()
tickers = pdf.get_data_yahoo(["^GSPC", "GC=F", "CL=F"], start, end)['Adj Close']

gdp = wb.download(indicator='NY.GDP.MKTP.CD', country='US', start=2000, end=2019)
gdp = gdp.reset_index(1).set_index('year')
gdp.index = pd.to_datetime(gdp.index, format="%Y")

data = gdp.join(tickers, how='outer')
data = data.interpolate(method='linear')
data = data.dropna()

data.columns = ["US GDP", "S&P 500", "Gold", "Oil"]

print(data)

Resulting in the following output.

Python 3.8.2 (default, Feb 26 2020, 02:56:10)
                  US GDP      S&P 500         Gold        Oil
2000-08-30  1.047113e+13  1502.589966   273.899994  33.400002
2000-08-31  1.047243e+13  1517.680054   278.299988  33.099998
2000-09-01  1.047373e+13  1520.770020   277.000000  33.380001
2000-09-05  1.047503e+13  1507.079956   275.799988  33.799999
2000-09-06  1.047634e+13  1492.250000   274.200012  34.950001
...                  ...          ...          ...        ...
2020-08-05  2.142770e+13  3327.770020  2031.099976  42.189999
2020-08-06  2.142770e+13  3349.159912  2051.500000  41.950001
2020-08-07  2.142770e+13  3351.280029  2046.099976  41.599998
2020-08-09  2.142770e+13  3351.280029  2037.099976  41.590000
2020-08-10  2.142770e+13  3351.280029  2043.900024  41.889999

Where we see the data we want to investigate for correlations.

Step 2: Investigate Pearson correlation coefficients

Looking at the corr() function on DataFrames it calculate the pairwise correlation between columns and returns a correlation matrix.

The default method is the Pearson correlation coefficient method. As we will see in this tutorial, correlations can be calculated differently. The Pearson is trying to correlate through a straight line between the variables.

The best way to understand that is by using an example.

Let’s first calculate the correlation matrix using the Pearson method and then try to visualize it to understand it better. You can get the correlation method simply by calling corr() on the DataFrame.

print(data.corr())

As it is the default method you do not need to set it be pearson. The output will be.

           US GDP   S&P 500      Gold       Oil
US GDP   1.000000  0.897376  0.817294  0.237426
S&P 500  0.897376  1.000000  0.581576 -0.015951
Gold     0.817294  0.581576  1.000000  0.534163
Oil      0.237426 -0.015951  0.534163  1.000000

A few words on a correlation matrix. The output of the correlation function is a number from -1 to 1. Some high-level interpretations of the output.

  • -1: A full negative correlation. Meaning if variable goes up, the other variable goes down and they are fully correlated.
  • 0: No correlation at all. Meaning that the two variables are not dependent at all. If one goes up, you cannot predict with any probability what will happen to the other.
  • 1: A full correlation. Meaning if the one variable goes up, so will the other.

Numbers between are just indication how much they are dependet.

Looking at the above output, you see that US GDP fully correlates to US GDP. This is obvious, as it is the same variable. Next we have a 0.897376 correlation between US GDP and S&P 500 stock market index. This tells us that there is a high correlation.

Now to be a bit more specific. This correlation is linear.

That means it can be fitted well with a straight line. Let’s try to visualize that.

import matplotlib.pyplot as plt
import numpy as np


# Pearson fit (default method)
fit = np.polyfit(x=data['US GDP'], y=data['S&P 500'], deg=1)
line_fit = np.poly1d(fit)
plt.plot(data['US GDP'], line_fit(data['US GDP']))
plt.scatter(x=data['US GDP'], y=data['S&P 500'], color='red', alpha=0.1)
plt.title("Pearson correlation")
plt.show()

Resulting in the following fit.

Also, let’s investigate something that does not fit well, the US GDP with Oil prices.

import matplotlib.pyplot as plt
import numpy as np


# Pearson fit (default method)
fit = np.polyfit(x=data['US GDP'], y=data['Oil'], deg=1)
line_fit = np.poly1d(fit)
plt.plot(data['US GDP'], line_fit(data['Oil']))
plt.scatter(x=data['US GDP'], y=data['Oil'], color='red', alpha=0.1)
plt.title("Pearson correlation")
plt.show()

As you can see visually, this does not fit as well to a straight line as the above example. The closer the markers are to a fitted straight line, the higher score of the correlation using Pearson. This is independent on the slope of the line, except if the slope is positive (resulting in positive values) or negative (resulting in negative values).

Just some notes to consider about Pearson correlation coefficient. The requirement of the variables being normally distributed is controversial and outside the scope of this tutorial. That said, be careful concluding based on the result. It might be an indicator, but do not conclude any linear correlations or not based on the result.

Step 3: Investigating the Kendall rank correlation coefficients

The Kendall rank correlation coefficient does not assume a normal distribution of the variables and is looking for a monotonic relationship between two variables.

Two variables are monotonic correlated if any greater value of the one variable will result in a greater value of the other variable. If the variables is negatively monotonic correlated, then it is opposite.

The correlation can be calculated as follows.

print(data.corr(method="kendall"))

Resulting in the following output.

           US GDP   S&P 500      Gold       Oil
US GDP   1.000000  0.703141  0.685002  0.249430
S&P 500  0.703141  1.000000  0.426406  0.122434
Gold     0.685002  0.426406  1.000000  0.413298
Oil      0.249430  0.122434  0.413298  1.000000

Which interestingly shows that the Pearson correlation coefficient of US GDP and S&P 500 is higher than the Kendall rank correlation.

As a rule thumb, a correlation less than 0.8 (or greater than -0.8) is considered insignificant and not strongly correlated. This means, that the correlation of US GDP and S&P 500 seems to have a linear correlation but not a strong monotonic correlation.

Remember that these are two different measures and can not be directly compared. As they measure different aspects, it is not surprising. The Pearson method can be thought of how close the points are to a fitted line, while the Kendall method looks if the one variable grows, does the other. As you see on the map, this seems not to be the case. There are many instances where it does not happen.

Step 4: Investigating the Spearman rank correlation

Spearman is closely related to Kendall, and measures whether the variables are monotonically correlated.

The Spearman rank correlation can be computed by the following.

print(data.corr(method="spearman"))

And results in the following output.

           US GDP   S&P 500      Gold       Oil
US GDP   1.000000  0.846197  0.837650  0.317295
S&P 500  0.846197  1.000000  0.609104  0.178937
Gold     0.837650  0.609104  1.000000  0.558569
Oil      0.317295  0.178937  0.558569  1.000000

Which actually is a bit more optimistic about the monotonic correlation between the US GDP and S&P 500.

Can we then conclude that when US GDP goes up, the S&P 500 goes up? Good question. The short answer is no. Example that might make it more understandable. In summer time ice cream sales go up. But also, in summer time sun glass sales goes up. Does that mean that higher ice cream sales implies higher sun glass sales? Not really. It is the factor that there is more sun that affect it.

The same can be true for correlations you find in data. Just think of it as an indicator that they somehow might be connected (or not, if value is close to 0).

Step 5: When to use what?

This is a good question.

  • Pearson correlation coefficient is in general considered stronger as has higher assumptions on data. On the negative, it only considers a full linear dependence (fitting to a straight line) and in (theory) requires the variables to be normally distributed. It is very fragile to outliers (single points far away from the norm).
  • Kendall rank correlation coefficient should be more efficient with smaller sets. It measures the monotonic relationship between two variables, and it is a bit slower to calculate O(n^2). It does not require the variables to be normally distributed.
  • Spearman rank correlation coefficient also measures the monotonic relationship between two variables. The speed is faster O(n log(n)). It often gives a slightly higher value than Kendalls. It also does not require the variables to be normally distributed.

Pandas: Explore Datasets by Visualization – Exploring the Holland Code (RIASEC) Test – Part III

What will we cover in this tutorial?

We will continue our journey to explore a big dataset of 145,000+ respondents to a RIASEC test. If you want to explore the full journey, we recommend you read this tutorial first as well as the second part of the tutorial.

In this part we are going to combine some data into 6 dimensions of personality types of the RIASEC and see it there is any correlation with the educational level.

Step 1: Understand the dataset better

The dataset is combined in letting the respondents rate themselves on statements related to the 6 personality types in RIASEC. The personality types are given as follows (also see wikipedia for deeper description).

  • Realistic (R): People that like to work with things. They tend to be “assertive and competitive, and are interested in activities requiring motor coordination, skill and strength”. They approach problem solving “by doing something, rather than talking about it, or sitting and thinking about it”. They also prefer “concrete approaches to problem solving, rather than abstract theory”. Finally, their interests tend to focus on “scientific or mechanical rather than cultural and aesthetic areas”.
  • Investigative (I): People who prefer to work with “data.” They like to “think and observe rather than act, to organize and understand information rather than to persuade”. They also prefer “individual rather than people oriented activities”.
  • Artistic (A): People who like to work with “ideas and things”. They tend to be “creative, open, inventive, original, perceptive, sensitive, independent and emotional”. They rebel against “structure and rules”, but enjoy “tasks involving people or physical skills”. They tend to be more emotional than the other types.
  • Social (S): People who like to work with “people” and who “seem to satisfy their needs in teaching or helping situations”. They tend to be “drawn more to seek close relationships with other people and are less apt to want to be really intellectual or physical”.
  • Enterprising (E): People who like to work with “people and data”. They tend to be “good talkers, and use this skill to lead or persuade others”. They “also value reputation, power, money and status”.
  • Conventional (C): People who prefer to work with “data” and who “like rules and regulations and emphasize self-control … they like structure and order, and dislike unstructured or unclear work and interpersonal situations”. They also “place value on reputation, power, or status”.

In the test they have rated themselves from 1 to 5 (1=Dislike, 3=Neutral, 5=Enjoy) on statements related to these 6 personality types.

That way each respondent can be rated on these 6 dimensions.

Step 2: Prepare the dataset

We want to score the respondent according to how they have rated themselves on the 8 statements for each of the 6 personality types.

This can be achieved by the following code.

import pandas as pd


def sum_dimension(data, letter):
    return data[letter + '1'] + data[letter + '2'] + data[letter + '3'] + data[letter + '4'] + data[letter + '5'] + data[letter + '6'] + data[letter + '7'] + data[letter + '8']


data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['R'] = sum_dimension(data, 'R')
data['I'] = sum_dimension(data, 'I')
data['A'] = sum_dimension(data, 'A')
data['S'] = sum_dimension(data, 'S')
data['E'] = sum_dimension(data, 'E')
data['C'] = sum_dimension(data, 'C')

view = data.loc[:,['education', 'R', 'I', 'A', 'S', 'E', 'C']]
print(view)

In the view we make, we keep the education with the dimension ratings we have calculated, because we want to see if there is any correlation between education level and personality type.

We get the following output.

        education   R   I   A   S   E   C
0               2  20  33  27  37  16  12
1               2  14  35  19  22  10  10
2               2   9  11  11  30  24  16
3               1  15  21  27  20  25  19
4               3  13  36  34  37  20  26
...           ...  ..  ..  ..  ..  ..  ..
145823          3  10  19  28  28  20  13
145824          3  11  18  39  35  24  16
145825          2   8   8   8  36  12  21
145826          3  29  29  29  34  16  19
145827          2  21  33  19  30  27  24

Where we see the dimensions ratings and the corresponding education level.

Step 3: Compute the correlations

The education is given by the following scale.

  • 1: Less than high school
  • 2: High school
  • 3: University degree
  • 4: Graduate degree
  • 0: No answer

Hence, we need to remove the no-answer group (0) from the data to not skew the results.

import pandas as pd


def sum_dimension(data, letter):
    return data[letter + '1'] + data[letter + '2'] + data[letter + '3'] + data[letter + '4'] + data[letter + '5'] + data[letter + '6'] + data[letter + '7'] + data[letter + '8']


data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['R'] = sum_dimension(data, 'R')
data['I'] = sum_dimension(data, 'I')
data['A'] = sum_dimension(data, 'A')
data['S'] = sum_dimension(data, 'S')
data['E'] = sum_dimension(data, 'E')
data['C'] = sum_dimension(data, 'C')

view = data.loc[:,['education', 'R', 'I', 'A', 'S', 'E', 'C']]

view = view[view['education'] != 0]

print(view.mean())
print(view.groupby('education').mean())
print(view.corr())

The output of the mean is given here.

education     2.394318
R            16.651624
I            23.994637
A            22.887701
S            26.079349
E            20.490080
C            19.105188
dtype: float64

Which says that the average educational level of the 145,000+ respondents was 2.394318. Then you can see the respondent related on average mostly as Social, then Investigative. The lowest rated group was Realistic.

The output of educational group by mean is given here.

                   R          I          A          S          E          C
education                                                                  
1          15.951952  23.103728  21.696007  23.170792  19.897772  17.315641
2          16.775297  23.873645  22.379625  25.936032  20.864591  19.551138
3          16.774487  24.302158  23.634034  27.317784  20.468160  19.606312
4          16.814534  24.769829  24.347250  27.382699  20.038501  18.762395

Where you can see that those with less than high school actually rate themselves lower in all dimensions. While the highest educated rate themselves highest on Realistic, Artistic, and Social.

Does that mean the higher education the more social, artistic or realistic you are?

The output of the correlation is given here.

           education         R         I         A         S         E         C
education   1.000000  0.029008  0.057466  0.105946  0.168640 -0.006115  0.044363
R           0.029008  1.000000  0.303895  0.206085  0.109370  0.340535  0.489504
I           0.057466  0.303895  1.000000  0.334159  0.232608  0.080878  0.126554
A           0.105946  0.206085  0.334159  1.000000  0.350631  0.322099  0.056576
S           0.168640  0.109370  0.232608  0.350631  1.000000  0.411564  0.213413
E          -0.006115  0.340535  0.080878  0.322099  0.411564  1.000000  0.526813
C           0.044363  0.489504  0.126554  0.056576  0.213413  0.526813  1.000000

As you see. You should conclude that. Take Social it is only 0.168640 correlated to education, which in other words means very low correlated. The same holds for Realistic and Artistic, very low correlation.

Step 4: Visualize our findings

A way to visualize the data is by using the great integration with Matplotlib.

import pandas as pd
import matplotlib.pyplot as plt


def sum_dimension(data, letter):
    return data[letter + '1'] + data[letter + '2'] + data[letter + '3'] + data[letter + '4'] + data[letter + '5'] + data[letter + '6'] + data[letter + '7'] + data[letter + '8']


data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['R'] = sum_dimension(data, 'R')
data['I'] = sum_dimension(data, 'I')
data['A'] = sum_dimension(data, 'A')
data['S'] = sum_dimension(data, 'S')
data['E'] = sum_dimension(data, 'E')
data['C'] = sum_dimension(data, 'C')

view = data.loc[:,['education', 'R', 'I', 'A', 'S', 'E', 'C']]

view = view[view['education'] != 0]

edu = view.groupby('education').mean()
edu.index = ['> high school', 'high school', 'university', 'graduate']
edu.plot(kind='barh', figsize=(10,4))
plt.show()

Resulting in the following graph.

The output.

Finally, the correlation to education can be made similarly.

Note that the education itself was kept to have a perspective of full correlation.

Continue to read how to explore the dataset in the next tutorial.

Pandas DataFrame Merge: Inner, Outer, Left, and Right

What will we cover in this tutorial?

A key process in Data Science is to merge data from various sources. This can be challenging and often needs clarity. Here we will take some simple example and explain the differences of how to merge data using the pandas library‘s DataFrame object merge function.

The key ways to merge is by inner, outer, left, and right.

In this example we are going to explore what correlates the most to GDP per capita: yearly meet consumption, yearly beer consumption, or long-term unemployment.

What is your educated guess? (no cheating, the result is down below)

Step 1: The data we want to merge

That means we need to gather the specified data.

The GDP per capita can be found on wikipedia.org. As we are going to do a lot of the same code again and again, let’s make a helper function to get the data, index the correct table, and drop the data we do not use in our analysis.

This can be done like this.

import pandas as pd


# This is simply used to display all the data and not get a small window of it
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 1000)


# This is a helper function, read the URL, get the right table, drop some columns
def read_table(url, table_number, drop_columns):
    tables = pd.read_html(url)
    table = tables[table_number]
    table = table.drop(drop_columns, axis=1)
    return table


# GDP per capita
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
table = read_table(url, 3, ['Rank'])
table.rename(columns={'Country/Territory': 'Country'}, inplace=True)

print(table)

Which results in this output (or the few first lines of it).

                                    Country     US$
0                             Monaco (2018)  185741
1                      Liechtenstein (2017)  173356
2                                Luxembourg  114705
3                                     Macau   84096
4                               Switzerland   81994
5                                   Ireland   78661
6                                    Norway   75420
7                                   Iceland   66945

Comparing this to wikipedia.org.

From wikipedia.org

We can identify that this is the middle GDP, based on the World Bank.

Then we need data from the other sources. Here we get it for long-term unemployment (long-term unemployment is defined to be unemployed for 1 year or more).

# Long-term unemployement
url = 'https://en.wikipedia.org/wiki/List_of_OECD_countries_by_long-term_unemployment_rate'
table_join = read_table(url, 0, ['Long-term unemployment rate (2012)[1]'])
table_join.rename(columns={'Country/Territory': 'Country', 'Long-term unemployment rate (2016)[1]': 'Long-term unemployment rate'}, inplace=True)
index = 'Long-term unemployment rate'
table_join[index] = table_join[index].str[:-1].astype(float)

print(table_join)

Resulting in the following output

           Country  Long-term unemployment rate
0        Australia                         1.32
1          Austria                         1.53
2          Belgium                         4.26
3           Brazil                         0.81
4           Canada                         0.89
5            Chile                         1.67
6   Czech Republic                         2.72
7          Denmark                         1.66

This can be done for the two other dimensions we want to explore as well. We will skip it here, as the full code comes later.

Step 2: Simple merge it together

What happens if we merge the data together without considering which type or merge?

Skip reading the documentation also. Let’s just do it.

import pandas as pd


# This is simply used to display all the data and not get a small window of it
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 1000)


# This is a helper function, read the URL, get the right table, drop some columns
def read_table(url, table_number, drop_columns):
    tables = pd.read_html(url)
    table = tables[table_number]
    table = table.drop(drop_columns, axis=1)
    return table


# GDP per capita
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
table = read_table(url, 3, ['Rank'])
table.rename(columns={'Country/Territory': 'Country'}, inplace=True)

# Long-term unemployement
url = 'https://en.wikipedia.org/wiki/List_of_OECD_countries_by_long-term_unemployment_rate'
table_join = read_table(url, 0, ['Long-term unemployment rate (2012)[1]'])
table_join.rename(columns={'Country/Territory': 'Country', 'Long-term unemployment rate (2016)[1]': 'Long-term unemployment rate'}, inplace=True)
index = 'Long-term unemployment rate'
table_join[index] = table_join[index].str[:-1].astype(float)

table = pd.merge(table, table_join)

# Meat consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
table_join = read_table(url, 1, ['Kg/person (2009)[10]'])
table_join.rename(columns={'Kg/person (2002)[9][note 1]': 'Kg meat/person'}, inplace=True)

table = pd.merge(table, table_join)

# Beer consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_beer_consumption_per_capita'
table_join = read_table(url, 2, ['2018change(litres per year)', 'Total nationalconsumption[a](million litresper year)', 'Year', 'Sources'])
table_join.rename(columns={'Consumptionper capita[1](litres per year)': 'Liter beer/person'}, inplace=True)

table = pd.merge(table, table_join)


print(table)

# Calculate the correlation
table_corr = table.corr()

# Print the correlation to GDP per capita (stored in US$).
print(table_corr['US$'].sort_values(ascending=False))

Which result in the output from the first print statement to be (this is the full output).

           Country    US$  Long-term unemployment rate  Kg meat/person  Liter beer/person
0      Switzerland  81994                         1.71            72.9               55.5
1          Ireland  78661                         6.68           106.3               95.8
2          Denmark  59822                         1.66           145.9               59.6
3        Australia  54907                         1.32           108.2               76.3
4      Netherlands  52448                         2.98            89.3               78.1
5          Austria  50277                         1.53            94.1              107.6
6          Finland  48686                         1.97            67.4               76.7
7          Germany  46259                         2.21            82.1              101.1
8           Canada  46195                         0.89           108.1               55.7
9          Belgium  46117                         4.26            86.1               67.0
10          Israel  43641                         0.63            97.1               17.4
11  United Kingdom  42300                         2.22            79.6               72.9
12     New Zealand  42084                         0.78           142.1               65.5
13          France  40494                         4.21           101.1               33.0
14           Japan  40247                         1.36            45.9               41.4
15           Italy  33190                         7.79            90.4               31.0
16           Spain  29614                        12.92           118.6               86.0
17        Slovenia  25739                         5.27            88.0               80.2
18  Czech Republic  23102                         2.72            77.3              191.8
19        Slovakia  19329                         8.80            67.4               83.5
20         Hungary  16476                         3.78           100.7               76.8
21          Poland  15595                         3.26            78.1               98.2
22          Mexico   9863                         0.06            58.6               68.7
23          Turkey   9043                         2.04            19.3               13.0
24          Brazil   8717                         0.81            82.4               60.0

Strange, you might think? There is only 25 countries (counting from 0). Also, let’s look at the actual correlation between columns, which is the output of the second print statement.

S$                            1.000000
Kg meat/person                 0.392070
Liter beer/person             -0.021863
Long-term unemployment rate   -0.086968
Name: US$, dtype: float64

Correlations are quite low. It correlates the most with meat, but still not that much.

Step 3: Let’s read the types of merge available

Reading the documentation of merge, you will notice there are four types of merge.

  • left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
  • right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
  • outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
  • inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

We also see that inner merge is the default. So what does inner merge do?

It means, it will only merge on keys which exists for both DataFrames. Translated to our tables, it means, that the only remaining rows in the final merged table is the ones which exists for all 4 tables.

You can check that, it is the 25 countries listed there.

Step 4: Understand what we should do

What we are doing in the end is correlate to the GDP per capita. Hence, it only makes sense to keep the values that have a GDP.

Consider we used outer merge, then we will keep all combinations. That would not give any additional value to the calculations we want to do.

But let’s just try it and investigate the output.

import pandas as pd


# This is simply used to display all the data and not get a small window of it
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 1000)


# This is a helper function, read the URL, get the right table, drop some columns
def read_table(url, table_number, drop_columns):
    tables = pd.read_html(url)
    table = tables[table_number]
    table = table.drop(drop_columns, axis=1)
    return table


# GDP per capita
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
table = read_table(url, 3, ['Rank'])
table.rename(columns={'Country/Territory': 'Country'}, inplace=True)

# Long-term unemployement
url = 'https://en.wikipedia.org/wiki/List_of_OECD_countries_by_long-term_unemployment_rate'
table_join = read_table(url, 0, ['Long-term unemployment rate (2012)[1]'])
table_join.rename(columns={'Country/Territory': 'Country', 'Long-term unemployment rate (2016)[1]': 'Long-term unemployment rate'}, inplace=True)
index = 'Long-term unemployment rate'
table_join[index] = table_join[index].str[:-1].astype(float)

table = pd.merge(table, table_join, how='outer')

# Meat consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
table_join = read_table(url, 1, ['Kg/person (2009)[10]'])
table_join.rename(columns={'Kg/person (2002)[9][note 1]': 'Kg meat/person'}, inplace=True)

table = pd.merge(table, table_join, how='outer')

# Beer consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_beer_consumption_per_capita'
table_join = read_table(url, 2, ['2018change(litres per year)', 'Total nationalconsumption[a](million litresper year)', 'Year', 'Sources'])
table_join.rename(columns={'Consumptionper capita[1](litres per year)': 'Liter beer/person'}, inplace=True)

table = pd.merge(table, table_join, how='outer')


print(table)

# Calculate the correlation
table_corr = table.corr()

# Print the correlation to GDP per capita (stored in US$).
print(table_corr['US$'].sort_values(ascending=False))

First of all, this keeps all the output. I will not put it here, but only show a few lines.

                                    Country       US$  Long-term unemployment rate  Kg meat/person  Liter beer/person
0                             Monaco (2018)  185741.0                          NaN             NaN                NaN
1                      Liechtenstein (2017)  173356.0                          NaN             NaN                NaN
2                                Luxembourg  114705.0                         1.60           141.7                NaN
222                United States of America       NaN                          NaN           124.8                NaN
223            United States Virgin Islands       NaN                          NaN             6.6                NaN
224                               Venezuela       NaN                          NaN            56.6                NaN
225                                  Taiwan       NaN                          NaN             NaN               23.2

As the sample lines above shows, we get a row if one of them column is not NaN. Before when we used inner we would only get lines when all columns were not NaN.

The output of the correlation is now.

US$                            1.000000
Kg meat/person                 0.706692
Liter beer/person              0.305120
Long-term unemployment rate   -0.249958
Name: US$, dtype: float64

This is different values than from the previous example. Surprised? Not really. Now we have more data to correlate.

Step 5: Do the correct thing

If we inspect the code, we can see that the we start by having the GDP table on the left side. This growing table is always kept on the left side. Hence, we should be able to merge with left. Notice that this should not affect the final result.

Let’s try it.

import pandas as pd


# This is simply used to display all the data and not get a small window of it
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 1000)


# This is a helper function, read the URL, get the right table, drop some columns
def read_table(url, table_number, drop_columns):
    tables = pd.read_html(url)
    table = tables[table_number]
    table = table.drop(drop_columns, axis=1)
    return table


# GDP per capita
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
table = read_table(url, 3, ['Rank'])
table.rename(columns={'Country/Territory': 'Country'}, inplace=True)

# Long-term unemployement
url = 'https://en.wikipedia.org/wiki/List_of_OECD_countries_by_long-term_unemployment_rate'
table_join = read_table(url, 0, ['Long-term unemployment rate (2012)[1]'])
table_join.rename(columns={'Country/Territory': 'Country', 'Long-term unemployment rate (2016)[1]': 'Long-term unemployment rate'}, inplace=True)
index = 'Long-term unemployment rate'
table_join[index] = table_join[index].str[:-1].astype(float)

table = pd.merge(table, table_join, how='left')

# Meat consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
table_join = read_table(url, 1, ['Kg/person (2009)[10]'])
table_join.rename(columns={'Kg/person (2002)[9][note 1]': 'Kg meat/person'}, inplace=True)

table = pd.merge(table, table_join, how='left')

# Beer consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_beer_consumption_per_capita'
table_join = read_table(url, 2, ['2018change(litres per year)', 'Total nationalconsumption[a](million litresper year)', 'Year', 'Sources'])
table_join.rename(columns={'Consumptionper capita[1](litres per year)': 'Liter beer/person'}, inplace=True)

table = pd.merge(table, table_join, how='left')


print(table)

# Calculate the correlation
table_corr = table.corr()

# Print the correlation to GDP per capita (stored in US$).
print(table_corr['US$'].sort_values(ascending=False))

Resulting in the same final print statement.

US$                            1.000000
Kg meat/person                 0.706692
Liter beer/person              0.305120
Long-term unemployment rate   -0.249958
Name: US$, dtype: float64

Question: What does the data tell us?

Good question. What does our finding tell us? Let’s inspect the final output.

US$                            1.000000
Kg meat/person                 0.706692
Liter beer/person              0.305120
Long-term unemployment rate   -0.249958
Name: US$, dtype: float64

The row with US$ shows the full correlation to GDP per capita, which obviously has 100% (1.00000) correlation to GDP per capita, as it is the number itself.

The second row tells us that eating a lot of meat is highly correlated to GDP per capita. Does that then mean that a country should encourage all citizens to eat more meat to become richer? No, you cannot conclude that. It is probably the other way around. The richer a country is, the more meat they eat.

The last line tells us that long-term unemployment is negative related to GDP per capita. It is not surprising. It means, the more long-term unemployed people, the less GDP per capita. But it is not highly correlated, only (approximately) -25%.

Surprisingly, it seems to have bigger positive impact to drink a lot of beer, then it has negative impact of long-term unemployment.

What a wonderful world.

Exit mobile version