## What will we cover in this tutorial?

In this tutorial we will on a live example investigate and understand the differences between the 3 methods to calculate correlation using Pandas DataFrame corr() function.

The purpose of this tutorial is to get a better understanding of these correlations, while working on real data.

## Step 1: Getting some data to play with

The data we want to investigate for correlations is the US GDP, S&P 500, Gold and Oil prices. We will only focus on recent time (from 2000-2020), as the prices for Gold and Oil are not available further back on Yahoo! finance. We will get the US GDP from World Bank and the rest from Yahoo! finance.

We will be using the Pandas-datareader to retrieve the data. For a more in-depth introduction to how to use them, we will refer you to this tutorial.

```import pandas_datareader as pdf
import datetime as dt

start = dt.datetime(2000, 1, 1)
end = dt.datetime.now()
tickers = pdf.get_data_yahoo(["^GSPC", "GC=F", "CL=F"], start, end)['Adj Close']
gdp = gdp.reset_index(1).set_index('year')
gdp.index = pd.to_datetime(gdp.index, format="%Y")
data = gdp.join(tickers, how='outer')
data = data.interpolate(method='linear')
data = data.dropna()
data.columns = ["US GDP", "S&P 500", "Gold", "Oil"]
print(data)
```

Resulting in the following output.

```Python 3.8.2 (default, Feb 26 2020, 02:56:10)
US GDP      S&P 500         Gold        Oil
2000-08-30  1.047113e+13  1502.589966   273.899994  33.400002
2000-08-31  1.047243e+13  1517.680054   278.299988  33.099998
2000-09-01  1.047373e+13  1520.770020   277.000000  33.380001
2000-09-05  1.047503e+13  1507.079956   275.799988  33.799999
2000-09-06  1.047634e+13  1492.250000   274.200012  34.950001
...                  ...          ...          ...        ...
2020-08-05  2.142770e+13  3327.770020  2031.099976  42.189999
2020-08-06  2.142770e+13  3349.159912  2051.500000  41.950001
2020-08-07  2.142770e+13  3351.280029  2046.099976  41.599998
2020-08-09  2.142770e+13  3351.280029  2037.099976  41.590000
2020-08-10  2.142770e+13  3351.280029  2043.900024  41.889999
```

Where we see the data we want to investigate for correlations.

## Step 2: Investigate Pearson correlation coefficients

Looking at the corr() function on DataFrames it calculate the pairwise correlation between columns and returns a correlation matrix.

The default method is the Pearson correlation coefficient method. As we will see in this tutorial, correlations can be calculated differently. The Pearson is trying to correlate through a straight line between the variables.

The best way to understand that is by using an example.

Let’s first calculate the correlation matrix using the Pearson method and then try to visualize it to understand it better. You can get the correlation method simply by calling corr() on the DataFrame.

```print(data.corr())
```

As it is the default method you do not need to set it be pearson. The output will be.

```           US GDP   S&P 500      Gold       Oil
US GDP   1.000000  0.897376  0.817294  0.237426
S&P 500  0.897376  1.000000  0.581576 -0.015951
Gold     0.817294  0.581576  1.000000  0.534163
Oil      0.237426 -0.015951  0.534163  1.000000
```

A few words on a correlation matrix. The output of the correlation function is a number from -1 to 1. Some high-level interpretations of the output.

• -1: A full negative correlation. Meaning if variable goes up, the other variable goes down and they are fully correlated.
• 0: No correlation at all. Meaning that the two variables are not dependent at all. If one goes up, you cannot predict with any probability what will happen to the other.
• 1: A full correlation. Meaning if the one variable goes up, so will the other.

Numbers between are just indication how much they are dependet.

Looking at the above output, you see that US GDP fully correlates to US GDP. This is obvious, as it is the same variable. Next we have a 0.897376 correlation between US GDP and S&P 500 stock market index. This tells us that there is a high correlation.

Now to be a bit more specific. This correlation is linear.

That means it can be fitted well with a straight line. Let’s try to visualize that.

```import matplotlib.pyplot as plt
import numpy as np

# Pearson fit (default method)
fit = np.polyfit(x=data['US GDP'], y=data['S&P 500'], deg=1)
line_fit = np.poly1d(fit)
plt.plot(data['US GDP'], line_fit(data['US GDP']))
plt.scatter(x=data['US GDP'], y=data['S&P 500'], color='red', alpha=0.1)
plt.title("Pearson correlation")
plt.show()
```

Resulting in the following fit.

Also, let’s investigate something that does not fit well, the US GDP with Oil prices.

```import matplotlib.pyplot as plt
import numpy as np

# Pearson fit (default method)
fit = np.polyfit(x=data['US GDP'], y=data['Oil'], deg=1)
line_fit = np.poly1d(fit)
plt.plot(data['US GDP'], line_fit(data['Oil']))
plt.scatter(x=data['US GDP'], y=data['Oil'], color='red', alpha=0.1)
plt.title("Pearson correlation")
plt.show()
```

As you can see visually, this does not fit as well to a straight line as the above example. The closer the markers are to a fitted straight line, the higher score of the correlation using Pearson. This is independent on the slope of the line, except if the slope is positive (resulting in positive values) or negative (resulting in negative values).

Just some notes to consider about Pearson correlation coefficient. The requirement of the variables being normally distributed is controversial and outside the scope of this tutorial. That said, be careful concluding based on the result. It might be an indicator, but do not conclude any linear correlations or not based on the result.

## Step 3: Investigating the Kendall rank correlation coefficients

The Kendall rank correlation coefficient does not assume a normal distribution of the variables and is looking for a monotonic relationship between two variables.

Two variables are monotonic correlated if any greater value of the one variable will result in a greater value of the other variable. If the variables is negatively monotonic correlated, then it is opposite.

The correlation can be calculated as follows.

```print(data.corr(method="kendall"))
```

Resulting in the following output.

```           US GDP   S&P 500      Gold       Oil
US GDP   1.000000  0.703141  0.685002  0.249430
S&P 500  0.703141  1.000000  0.426406  0.122434
Gold     0.685002  0.426406  1.000000  0.413298
Oil      0.249430  0.122434  0.413298  1.000000
```

Which interestingly shows that the Pearson correlation coefficient of US GDP and S&P 500 is higher than the Kendall rank correlation.

As a rule thumb, a correlation less than 0.8 (or greater than -0.8) is considered insignificant and not strongly correlated. This means, that the correlation of US GDP and S&P 500 seems to have a linear correlation but not a strong monotonic correlation.

Remember that these are two different measures and can not be directly compared. As they measure different aspects, it is not surprising. The Pearson method can be thought of how close the points are to a fitted line, while the Kendall method looks if the one variable grows, does the other. As you see on the map, this seems not to be the case. There are many instances where it does not happen.

## Step 4: Investigating the Spearman rank correlation

Spearman is closely related to Kendall, and measures whether the variables are monotonically correlated.

The Spearman rank correlation can be computed by the following.

```print(data.corr(method="spearman"))
```

And results in the following output.

```           US GDP   S&P 500      Gold       Oil
US GDP   1.000000  0.846197  0.837650  0.317295
S&P 500  0.846197  1.000000  0.609104  0.178937
Gold     0.837650  0.609104  1.000000  0.558569
Oil      0.317295  0.178937  0.558569  1.000000
```

Which actually is a bit more optimistic about the monotonic correlation between the US GDP and S&P 500.

Can we then conclude that when US GDP goes up, the S&P 500 goes up? Good question. The short answer is no. Example that might make it more understandable. In summer time ice cream sales go up. But also, in summer time sun glass sales goes up. Does that mean that higher ice cream sales implies higher sun glass sales? Not really. It is the factor that there is more sun that affect it.

The same can be true for correlations you find in data. Just think of it as an indicator that they somehow might be connected (or not, if value is close to 0).

## Step 5: When to use what?

This is a good question.

• Pearson correlation coefficient is in general considered stronger as has higher assumptions on data. On the negative, it only considers a full linear dependence (fitting to a straight line) and in (theory) requires the variables to be normally distributed. It is very fragile to outliers (single points far away from the norm).
• Kendall rank correlation coefficient should be more efficient with smaller sets. It measures the monotonic relationship between two variables, and it is a bit slower to calculate O(n^2). It does not require the variables to be normally distributed.
• Spearman rank correlation coefficient also measures the monotonic relationship between two variables. The speed is faster O(n log(n)). It often gives a slightly higher value than Kendalls. It also does not require the variables to be normally distributed.

## What will we cover in this tutorial?

A key process in Data Science is to merge data from various sources. This can be challenging and often needs clarity. Here we will take some simple example and explain the differences of how to merge data using the pandas library‘s DataFrame object merge function.

The key ways to merge is by inner, outer, left, and right.

In this example we are going to explore what correlates the most to GDP per capita: yearly meet consumption, yearly beer consumption, or long-term unemployment.

What is your educated guess? (no cheating, the result is down below)

## Step 1: The data we want to merge

That means we need to gather the specified data.

The GDP per capita can be found on wikipedia.org. As we are going to do a lot of the same code again and again, let’s make a helper function to get the data, index the correct table, and drop the data we do not use in our analysis.

This can be done like this.

```import pandas as pd

# This is simply used to display all the data and not get a small window of it
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 1000)

# This is a helper function, read the URL, get the right table, drop some columns
table = tables[table_number]
table = table.drop(drop_columns, axis=1)
return table

# GDP per capita
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
table.rename(columns={'Country/Territory': 'Country'}, inplace=True)
print(table)
```

Which results in this output (or the few first lines of it).

```                                    Country     US\$
0                             Monaco (2018)  185741
1                      Liechtenstein (2017)  173356
2                                Luxembourg  114705
3                                     Macau   84096
4                               Switzerland   81994
5                                   Ireland   78661
6                                    Norway   75420
7                                   Iceland   66945
```

Comparing this to wikipedia.org.

We can identify that this is the middle GDP, based on the World Bank.

Then we need data from the other sources. Here we get it for long-term unemployment (long-term unemployment is defined to be unemployed for 1 year or more).

```# Long-term unemployement
url = 'https://en.wikipedia.org/wiki/List_of_OECD_countries_by_long-term_unemployment_rate'
table_join = read_table(url, 0, ['Long-term unemployment rate (2012)[1]'])
table_join.rename(columns={'Country/Territory': 'Country', 'Long-term unemployment rate (2016)[1]': 'Long-term unemployment rate'}, inplace=True)
index = 'Long-term unemployment rate'
table_join[index] = table_join[index].str[:-1].astype(float)
print(table_join)
```

Resulting in the following output

```           Country  Long-term unemployment rate
0        Australia                         1.32
1          Austria                         1.53
2          Belgium                         4.26
3           Brazil                         0.81
5            Chile                         1.67
6   Czech Republic                         2.72
7          Denmark                         1.66
```

This can be done for the two other dimensions we want to explore as well. We will skip it here, as the full code comes later.

## Step 2: Simple merge it together

What happens if we merge the data together without considering which type or merge?

Skip reading the documentation also. Let’s just do it.

```import pandas as pd

# This is simply used to display all the data and not get a small window of it
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 1000)

# This is a helper function, read the URL, get the right table, drop some columns
table = tables[table_number]
table = table.drop(drop_columns, axis=1)
return table

# GDP per capita
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
table.rename(columns={'Country/Territory': 'Country'}, inplace=True)
# Long-term unemployement
url = 'https://en.wikipedia.org/wiki/List_of_OECD_countries_by_long-term_unemployment_rate'
table_join = read_table(url, 0, ['Long-term unemployment rate (2012)[1]'])
table_join.rename(columns={'Country/Territory': 'Country', 'Long-term unemployment rate (2016)[1]': 'Long-term unemployment rate'}, inplace=True)
index = 'Long-term unemployment rate'
table_join[index] = table_join[index].str[:-1].astype(float)
table = pd.merge(table, table_join)
# Meat consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
table_join = read_table(url, 1, ['Kg/person (2009)[10]'])
table_join.rename(columns={'Kg/person (2002)[9][note 1]': 'Kg meat/person'}, inplace=True)
table = pd.merge(table, table_join)
# Beer consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_beer_consumption_per_capita'
table_join = read_table(url, 2, ['2018change(litres per year)', 'Total nationalconsumption[a](million litresper year)', 'Year', 'Sources'])
table_join.rename(columns={'Consumptionper capita[1](litres per year)': 'Liter beer/person'}, inplace=True)
table = pd.merge(table, table_join)

print(table)
# Calculate the correlation
table_corr = table.corr()
# Print the correlation to GDP per capita (stored in US\$).
print(table_corr['US\$'].sort_values(ascending=False))
```

Which result in the output from the first print statement to be (this is the full output).

```           Country    US\$  Long-term unemployment rate  Kg meat/person  Liter beer/person
0      Switzerland  81994                         1.71            72.9               55.5
1          Ireland  78661                         6.68           106.3               95.8
2          Denmark  59822                         1.66           145.9               59.6
3        Australia  54907                         1.32           108.2               76.3
4      Netherlands  52448                         2.98            89.3               78.1
5          Austria  50277                         1.53            94.1              107.6
6          Finland  48686                         1.97            67.4               76.7
7          Germany  46259                         2.21            82.1              101.1
8           Canada  46195                         0.89           108.1               55.7
9          Belgium  46117                         4.26            86.1               67.0
10          Israel  43641                         0.63            97.1               17.4
11  United Kingdom  42300                         2.22            79.6               72.9
12     New Zealand  42084                         0.78           142.1               65.5
13          France  40494                         4.21           101.1               33.0
14           Japan  40247                         1.36            45.9               41.4
15           Italy  33190                         7.79            90.4               31.0
16           Spain  29614                        12.92           118.6               86.0
17        Slovenia  25739                         5.27            88.0               80.2
18  Czech Republic  23102                         2.72            77.3              191.8
19        Slovakia  19329                         8.80            67.4               83.5
20         Hungary  16476                         3.78           100.7               76.8
21          Poland  15595                         3.26            78.1               98.2
22          Mexico   9863                         0.06            58.6               68.7
23          Turkey   9043                         2.04            19.3               13.0
24          Brazil   8717                         0.81            82.4               60.0
```

Strange, you might think? There is only 25 countries (counting from 0). Also, let’s look at the actual correlation between columns, which is the output of the second print statement.

```S\$                            1.000000
Kg meat/person                 0.392070
Liter beer/person             -0.021863
Long-term unemployment rate   -0.086968
Name: US\$, dtype: float64
```

Correlations are quite low. It correlates the most with meat, but still not that much.

## Step 3: Let’s read the types of merge available

Reading the documentation of merge, you will notice there are four types of merge.

• left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
• right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
• outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
• inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

We also see that inner merge is the default. So what does inner merge do?

It means, it will only merge on keys which exists for both DataFrames. Translated to our tables, it means, that the only remaining rows in the final merged table is the ones which exists for all 4 tables.

You can check that, it is the 25 countries listed there.

## Step 4: Understand what we should do

What we are doing in the end is correlate to the GDP per capita. Hence, it only makes sense to keep the values that have a GDP.

Consider we used outer merge, then we will keep all combinations. That would not give any additional value to the calculations we want to do.

But let’s just try it and investigate the output.

```import pandas as pd

# This is simply used to display all the data and not get a small window of it
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 1000)

# This is a helper function, read the URL, get the right table, drop some columns
table = tables[table_number]
table = table.drop(drop_columns, axis=1)
return table

# GDP per capita
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
table.rename(columns={'Country/Territory': 'Country'}, inplace=True)
# Long-term unemployement
url = 'https://en.wikipedia.org/wiki/List_of_OECD_countries_by_long-term_unemployment_rate'
table_join = read_table(url, 0, ['Long-term unemployment rate (2012)[1]'])
table_join.rename(columns={'Country/Territory': 'Country', 'Long-term unemployment rate (2016)[1]': 'Long-term unemployment rate'}, inplace=True)
index = 'Long-term unemployment rate'
table_join[index] = table_join[index].str[:-1].astype(float)
table = pd.merge(table, table_join, how='outer')
# Meat consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
table_join = read_table(url, 1, ['Kg/person (2009)[10]'])
table_join.rename(columns={'Kg/person (2002)[9][note 1]': 'Kg meat/person'}, inplace=True)
table = pd.merge(table, table_join, how='outer')
# Beer consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_beer_consumption_per_capita'
table_join = read_table(url, 2, ['2018change(litres per year)', 'Total nationalconsumption[a](million litresper year)', 'Year', 'Sources'])
table_join.rename(columns={'Consumptionper capita[1](litres per year)': 'Liter beer/person'}, inplace=True)
table = pd.merge(table, table_join, how='outer')

print(table)
# Calculate the correlation
table_corr = table.corr()
# Print the correlation to GDP per capita (stored in US\$).
print(table_corr['US\$'].sort_values(ascending=False))
```

First of all, this keeps all the output. I will not put it here, but only show a few lines.

```                                    Country       US\$  Long-term unemployment rate  Kg meat/person  Liter beer/person
0                             Monaco (2018)  185741.0                          NaN             NaN                NaN
1                      Liechtenstein (2017)  173356.0                          NaN             NaN                NaN
2                                Luxembourg  114705.0                         1.60           141.7                NaN
222                United States of America       NaN                          NaN           124.8                NaN
223            United States Virgin Islands       NaN                          NaN             6.6                NaN
224                               Venezuela       NaN                          NaN            56.6                NaN
225                                  Taiwan       NaN                          NaN             NaN               23.2
```

As the sample lines above shows, we get a row if one of them column is not NaN. Before when we used inner we would only get lines when all columns were not NaN.

The output of the correlation is now.

```US\$                            1.000000
Kg meat/person                 0.706692
Liter beer/person              0.305120
Long-term unemployment rate   -0.249958
Name: US\$, dtype: float64
```

This is different values than from the previous example. Surprised? Not really. Now we have more data to correlate.

## Step 5: Do the correct thing

If we inspect the code, we can see that the we start by having the GDP table on the left side. This growing table is always kept on the left side. Hence, we should be able to merge with left. Notice that this should not affect the final result.

Let’s try it.

```import pandas as pd

# This is simply used to display all the data and not get a small window of it
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 1000)

# This is a helper function, read the URL, get the right table, drop some columns
table = tables[table_number]
table = table.drop(drop_columns, axis=1)
return table

# GDP per capita
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
table.rename(columns={'Country/Territory': 'Country'}, inplace=True)
# Long-term unemployement
url = 'https://en.wikipedia.org/wiki/List_of_OECD_countries_by_long-term_unemployment_rate'
table_join = read_table(url, 0, ['Long-term unemployment rate (2012)[1]'])
table_join.rename(columns={'Country/Territory': 'Country', 'Long-term unemployment rate (2016)[1]': 'Long-term unemployment rate'}, inplace=True)
index = 'Long-term unemployment rate'
table_join[index] = table_join[index].str[:-1].astype(float)
table = pd.merge(table, table_join, how='left')
# Meat consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
table_join = read_table(url, 1, ['Kg/person (2009)[10]'])
table_join.rename(columns={'Kg/person (2002)[9][note 1]': 'Kg meat/person'}, inplace=True)
table = pd.merge(table, table_join, how='left')
# Beer consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_beer_consumption_per_capita'
table_join = read_table(url, 2, ['2018change(litres per year)', 'Total nationalconsumption[a](million litresper year)', 'Year', 'Sources'])
table_join.rename(columns={'Consumptionper capita[1](litres per year)': 'Liter beer/person'}, inplace=True)
table = pd.merge(table, table_join, how='left')

print(table)
# Calculate the correlation
table_corr = table.corr()
# Print the correlation to GDP per capita (stored in US\$).
print(table_corr['US\$'].sort_values(ascending=False))
```

Resulting in the same final print statement.

```US\$                            1.000000
Kg meat/person                 0.706692
Liter beer/person              0.305120
Long-term unemployment rate   -0.249958
Name: US\$, dtype: float64
```

## Question: What does the data tell us?

Good question. What does our finding tell us? Let’s inspect the final output.

```US\$                            1.000000
Kg meat/person                 0.706692
Liter beer/person              0.305120
Long-term unemployment rate   -0.249958
Name: US\$, dtype: float64
```

The row with US\$ shows the full correlation to GDP per capita, which obviously has 100% (1.00000) correlation to GDP per capita, as it is the number itself.

The second row tells us that eating a lot of meat is highly correlated to GDP per capita. Does that then mean that a country should encourage all citizens to eat more meat to become richer? No, you cannot conclude that. It is probably the other way around. The richer a country is, the more meat they eat.

The last line tells us that long-term unemployment is negative related to GDP per capita. It is not surprising. It means, the more long-term unemployed people, the less GDP per capita. But it is not highly correlated, only (approximately) -25%.

Surprisingly, it seems to have bigger positive impact to drink a lot of beer, then it has negative impact of long-term unemployment.

What a wonderful world.

## What will we cover in this tutorial?

• How to get data using Pandas DataFrames.
• Clean the data and merge it together.
• Finally, how to see if there is any correlation between data columns.

## Step 1: Get the data you want to correlate

As an example, let’s assume you get the idea that there might be a correlation between GDP per capita, Social Progress Index (SPI), and Human Development Index (HDI), but is not sure whether SPI or HDI is closets correlated to GDP per capita.

Luckily, you have pandas to the rescue.

As the data is in three pages, you need to collect it by separately and merge it later. First, let us collect the data and inspect it.

The GDP per capita is located in the table on wikipedia presented in the picture below.

Which is actually three tables. We will use the World Bank table in our example. It can be collected by using a call to pandas read_html. If you are new to read_html we recommend you read this tutorial.

```import pandas as pd
# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
# read_html returns a list of tables from the URL
table = tables[3]
print(table)
```

Which will give an output similar to this.

```    Rank                       Country/Territory     US\$
0      1                           Monaco (2018)  185741
1      2                    Liechtenstein (2017)  173356
2      3                              Luxembourg  114705
3      —                                   Macau   84096
4      4                             Switzerland   81994
5      5                                 Ireland   78661
6      6                                  Norway   75420
7      7                                 Iceland   66945
```

The next table we need to get is the Social Progress Index (SPI) and looks like the picture shows below.

This can be collected to a DataFrame with the following code.

```import pandas as pd
url = 'https://en.wikipedia.org/wiki/Social_Progress_Index'
print(tables[1])
```

Which will print the following to the screen (or the top of it).

```                          Country 2019[9]
Country    Rank  Score Score.1
0                          Norway       1  90.95     NaN
1                         Denmark       2  90.09     NaN
2                     Switzerland       3  89.89     NaN
3                         Finland       4  89.56     NaN
4                          Sweden       5  89.45     NaN
```

Finally we need to read the Human Development Index (HDI), which can be seen on wikipedia as the following picture shows.

And can be collected with the following code.

```import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index'
print(tables[1])
```

Resulting in the following output.

```                           Rank                                             Country or Territory                         HDI                                                    Unnamed: 5_level_0
2018 data (2019 report)​[5] Change from previous year​[5]               Country or Territory 2018 data (2019 report)​[5] Average annual HDI growth (2010–2018)​[6]          Unnamed: 5_level_1
Very high human development   Very high human development        Very high human development Very high human development               Very high human development Very high human development
0                             1                           NaN                             Norway                       0.954                                     0.16%                         NaN
1                             2                           NaN                        Switzerland                       0.946                                     0.18%                         NaN
2                             3                           NaN                            Ireland                       0.942                                     0.71%                         NaN
3                             4                           NaN                            Germany                       0.939                                     0.25%                         NaN
4                             4                           (2)                          Hong Kong                       0.939                                     0.51%                         NaN
5                             6                           (1)                          Australia                       0.938                                     0.17%                         NaN
```

A bit more messy data table.

Now we have gathered all the data we need to clean it up and merge it together.

## Step 2: Clean and merge the data into one DataFrame

If we first inspect the data from the GDP per capita.

```    Rank                       Country/Territory     US\$
0      1                           Monaco (2018)  185741
1      2                    Liechtenstein (2017)  173356
2      3                              Luxembourg  114705
3      —                                   Macau   84096
4      4                             Switzerland   81994
5      5                                 Ireland   78661
6      6                                  Norway   75420
7      7                                 Iceland   66945
```

Notice that Country/Territory can have a year in parentheses, which will make it hard to merge. We need to clean that up. Also, we do not need the Rank column.

If we inspect the data of HDI.

```                          Country 2019[9]
Country    Rank  Score Score.1
0                          Norway       1  90.95     NaN
1                         Denmark       2  90.09     NaN
2                     Switzerland       3  89.89     NaN
3                         Finland       4  89.56     NaN
4                          Sweden       5  89.45     NaN
```

Here we notice that the first row is an additional description row, which we can remove. Further, we do not need the Rank and Score.1 columns.

Let’s try to merge it together. Notice that we use a lambda function to clean up the Country/Territory names. If you are new to lambda functions, we recommend you read this tutorial.

```import pandas as pd
# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
# read_html returns a list of tables from the URL
# The data is in table 3
table = tables[3]
# We need to clean the years in parenthesis from the country/territory field
table['Country'] = table.apply(lambda row: row['Country/Territory'].split(' (')[0], axis=1)
# We do not need the Rank and Country/Territory for more data
table = table.drop(['Rank', 'Country/Territory'], axis=1)
url = 'https://en.wikipedia.org/wiki/Social_Progress_Index'
merge_table = tables[1]
# The first level of the table can be dropped
merge_table.columns = merge_table.columns.droplevel(0)
# We do not need the Rank and Score.1 columns
merge_table = merge_table.drop(['Rank', 'Score.1'], axis=1)
# Need to rename the second column to SPI = Social Progress Index
merge_table.columns = ['Country', 'SPI']
# Ready to merge the tables
table = table.merge(merge_table, how="left", left_on=['Country'], right_on=['Country'])
print(table)
```

Which will result in an output like this.

```        US\$                            Country    SPI
0    185741                             Monaco    NaN
1    173356                      Liechtenstein    NaN
2    114705                         Luxembourg  87.66
3     84096                              Macau    NaN
4     81994                        Switzerland  89.89
5     78661                            Ireland  87.97
6     75420                             Norway  90.95
```

First validate that Monaco, Liechtenstein, and Macau do not have any SPI value. That seems to be correct.

Then we can proceed to the next table of HDI. Let us first inspect the data.

```                           Rank                                             Country or Territory                         HDI                                                    Unnamed: 5_level_0
2018 data (2019 report)​[5] Change from previous year​[5]               Country or Territory 2018 data (2019 report)​[5] Average annual HDI growth (2010–2018)​[6]          Unnamed: 5_level_1
Very high human development   Very high human development        Very high human development Very high human development               Very high human development Very high human development
0                             1                           NaN                             Norway                       0.954                                     0.16%                         NaN
1                             2                           NaN                        Switzerland                       0.946                                     0.18%                         NaN
2                             3                           NaN                            Ireland                       0.942                                     0.71%                         NaN
3                             4                           NaN                            Germany                       0.939                                     0.25%                         NaN
4                             4                           (2)                          Hong Kong                       0.939                                     0.51%                         NaN
5                             6                           (1)                          Australia                       0.938                                     0.17%                         NaN
```

It has a quite messy top level column naming in 3 layers. Dropping them will make some identical. To deal with that, we can rename them and delete those we do not need.

```import pandas as pd
# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
# read_html returns a list of tables from the URL
# The data is in table 3
table = tables[3]
# We need to clean the years in parenthesis from the country/territory field
table['Country'] = table.apply(lambda row: row['Country/Territory'].split(' (')[0], axis=1)
# We do not need the Rank and Country/Territory for more data
table = table.drop(['Rank', 'Country/Territory'], axis=1)
url = 'https://en.wikipedia.org/wiki/Social_Progress_Index'
merge_table = tables[1]
# The first level of the table can be dropped
merge_table.columns = merge_table.columns.droplevel(0)
# We do not need the Rank and Score.1 columns
merge_table = merge_table.drop(['Rank', 'Score.1'], axis=1)
# Need to rename the second column to SPI = Social Progress Index
merge_table.columns = ['Country', 'SPI']
# Ready to merge the tables
table = table.merge(merge_table, how="left", left_on=['Country'], right_on=['Country'])
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index'

merge_table = tables[1]
# Delete the additional column levels
merge_table.columns = merge_table.columns.droplevel(1)
merge_table.columns = merge_table.columns.droplevel(1)
# Rename the columns
merge_table.columns = ['Rank1', 'Rank2', 'Country', 'HDI', 'HDI-1', 'None']
# Delete the columns we do not need
merge_table = merge_table.drop(['Rank1', 'Rank2', 'HDI-1', 'None'], axis=1)
# Merge the tables
table = table.merge(merge_table, how="left", left_on=['Country'], right_on=['Country'])
print(table)
```

Which will result in the following output (or the top of it).

```        US\$                            Country    SPI    HDI
0    185741                             Monaco    NaN    NaN
1    173356                      Liechtenstein    NaN  0.917
2    114705                         Luxembourg  87.66  0.909
3     84096                              Macau    NaN    NaN
4     81994                        Switzerland  89.89  0.946
5     78661                            Ireland  87.97  0.942
6     75420                             Norway  90.95  0.954
```

Notice, that here Liechtenstein has HDI data, while Monaco and Macau do not have any data. While it is not visible, the HDI column is not made of float. It can be seen with a call to table.dtypes, which will output the following.

```US\$          int64
Country     object
SPI        float64
HDI         object
dtype: object
```

Which states that HDI is object, which in this case is a string. That means we need to convert it to float to make our final correlation computations. This can be done by using a lambda function.

```table['HDI'] = table.apply(lambda row: float(row['HDI']) if row['HDI'] is not np.nan else np.nan, axis=1) # HDI = Human Development Index
```

This actually makes the data ready to see if there is any correlations between GDP per capita and SPI and/or HDI.

## Step 3: Calculate the correlations

This is where the DataFrames from pandas come strong. It can do the entire work for you with one call to corr().

The full code is given below.

```import pandas as pd
import numpy as np

pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 1000)

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
# read_html returns a list of tables from the URL
# The data is in table 3
table = tables[3]
# We need to clean the years in parenthesis from the country/territory field
table['Country'] = table.apply(lambda row: row['Country/Territory'].split(' (')[0], axis=1)
# We do not need the Rank and Country/Territory for more data
table = table.drop(['Rank', 'Country/Territory'], axis=1)
url = 'https://en.wikipedia.org/wiki/Social_Progress_Index'
merge_table = tables[1]
# The first level of the table can be dropped
merge_table.columns = merge_table.columns.droplevel(0)
# We do not need the Rank and Score.1 columns
merge_table = merge_table.drop(['Rank', 'Score.1'], axis=1)
# Need to rename the second column to SPI = Social Progress Index
merge_table.columns = ['Country', 'SPI']
# Ready to merge the tables
table = table.merge(merge_table, how="left", left_on=['Country'], right_on=['Country'])
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index'

merge_table = tables[1]
# Delete the additional column levels
merge_table.columns = merge_table.columns.droplevel(1)
merge_table.columns = merge_table.columns.droplevel(1)
# Rename the columns
merge_table.columns = ['Rank1', 'Rank2', 'Country', 'HDI', 'HDI-1', 'None']
# Delete the columns we do not need
merge_table = merge_table.drop(['Rank1', 'Rank2', 'HDI-1', 'None'], axis=1)
# Merge the tables
table = table.merge(merge_table, how="left", left_on=['Country'], right_on=['Country'])
# Convert to floats
table['HDI'] = table.apply(lambda row: float(row['HDI']) if row['HDI'] is not np.nan else np.nan, axis=1) # HDI = Human Development Index
# Calculate the correlation
table_corr = table.corr()
# Print the correlation to GDP per capita (stored in US\$).
print(table_corr['US\$'].sort_values(ascending=False))
```

Which will result in the following output.

```US\$    1.000000
SPI    0.713946
HDI    0.663183
Name: US\$, dtype: float64
```

Hence, it seems that there is the biggest correlation between GDP per capita and SPI.

Notice, that the calculations ignores all Not a Number (np.nan).