What will we cover in this tutorial?
A key process in Data Science is to merge data from various sources. This can be challenging and often needs clarity. Here we will take some simple example and explain the differences of how to merge data using the pandas library‘s DataFrame object merge function.
The key ways to merge is by inner, outer, left, and right.
In this example we are going to explore what correlates the most to GDP per capita: yearly meet consumption, yearly beer consumption, or long-term unemployment.
What is your educated guess? (no cheating, the result is down below)
Step 1: The data we want to merge
That means we need to gather the specified data.
The GDP per capita can be found on wikipedia.org. As we are going to do a lot of the same code again and again, let’s make a helper function to get the data, index the correct table, and drop the data we do not use in our analysis.
This can be done like this.
import pandas as pd
# This is simply used to display all the data and not get a small window of it
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 1000)
# This is a helper function, read the URL, get the right table, drop some columns
def read_table(url, table_number, drop_columns):
tables = pd.read_html(url)
table = tables[table_number]
table = table.drop(drop_columns, axis=1)
return table
# GDP per capita
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
table = read_table(url, 3, ['Rank'])
table.rename(columns={'Country/Territory': 'Country'}, inplace=True)
print(table)
Which results in this output (or the few first lines of it).
Country US$
0 Monaco (2018) 185741
1 Liechtenstein (2017) 173356
2 Luxembourg 114705
3 Macau 84096
4 Switzerland 81994
5 Ireland 78661
6 Norway 75420
7 Iceland 66945
Comparing this to wikipedia.org.

We can identify that this is the middle GDP, based on the World Bank.
Then we need data from the other sources. Here we get it for long-term unemployment (long-term unemployment is defined to be unemployed for 1 year or more).
# Long-term unemployement
url = 'https://en.wikipedia.org/wiki/List_of_OECD_countries_by_long-term_unemployment_rate'
table_join = read_table(url, 0, ['Long-term unemployment rate (2012)[1]'])
table_join.rename(columns={'Country/Territory': 'Country', 'Long-term unemployment rate (2016)[1]': 'Long-term unemployment rate'}, inplace=True)
index = 'Long-term unemployment rate'
table_join[index] = table_join[index].str[:-1].astype(float)
print(table_join)
Resulting in the following output
Country Long-term unemployment rate
0 Australia 1.32
1 Austria 1.53
2 Belgium 4.26
3 Brazil 0.81
4 Canada 0.89
5 Chile 1.67
6 Czech Republic 2.72
7 Denmark 1.66
This can be done for the two other dimensions we want to explore as well. We will skip it here, as the full code comes later.
Step 2: Simple merge it together
What happens if we merge the data together without considering which type or merge?
Skip reading the documentation also. Let’s just do it.
import pandas as pd
# This is simply used to display all the data and not get a small window of it
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 1000)
# This is a helper function, read the URL, get the right table, drop some columns
def read_table(url, table_number, drop_columns):
tables = pd.read_html(url)
table = tables[table_number]
table = table.drop(drop_columns, axis=1)
return table
# GDP per capita
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
table = read_table(url, 3, ['Rank'])
table.rename(columns={'Country/Territory': 'Country'}, inplace=True)
# Long-term unemployement
url = 'https://en.wikipedia.org/wiki/List_of_OECD_countries_by_long-term_unemployment_rate'
table_join = read_table(url, 0, ['Long-term unemployment rate (2012)[1]'])
table_join.rename(columns={'Country/Territory': 'Country', 'Long-term unemployment rate (2016)[1]': 'Long-term unemployment rate'}, inplace=True)
index = 'Long-term unemployment rate'
table_join[index] = table_join[index].str[:-1].astype(float)
table = pd.merge(table, table_join)
# Meat consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
table_join = read_table(url, 1, ['Kg/person (2009)[10]'])
table_join.rename(columns={'Kg/person (2002)[9][note 1]': 'Kg meat/person'}, inplace=True)
table = pd.merge(table, table_join)
# Beer consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_beer_consumption_per_capita'
table_join = read_table(url, 2, ['2018change(litres per year)', 'Total nationalconsumption[a](million litresper year)', 'Year', 'Sources'])
table_join.rename(columns={'Consumptionper capita[1](litres per year)': 'Liter beer/person'}, inplace=True)
table = pd.merge(table, table_join)
print(table)
# Calculate the correlation
table_corr = table.corr()
# Print the correlation to GDP per capita (stored in US$).
print(table_corr['US$'].sort_values(ascending=False))
Which result in the output from the first print statement to be (this is the full output).
Country US$ Long-term unemployment rate Kg meat/person Liter beer/person
0 Switzerland 81994 1.71 72.9 55.5
1 Ireland 78661 6.68 106.3 95.8
2 Denmark 59822 1.66 145.9 59.6
3 Australia 54907 1.32 108.2 76.3
4 Netherlands 52448 2.98 89.3 78.1
5 Austria 50277 1.53 94.1 107.6
6 Finland 48686 1.97 67.4 76.7
7 Germany 46259 2.21 82.1 101.1
8 Canada 46195 0.89 108.1 55.7
9 Belgium 46117 4.26 86.1 67.0
10 Israel 43641 0.63 97.1 17.4
11 United Kingdom 42300 2.22 79.6 72.9
12 New Zealand 42084 0.78 142.1 65.5
13 France 40494 4.21 101.1 33.0
14 Japan 40247 1.36 45.9 41.4
15 Italy 33190 7.79 90.4 31.0
16 Spain 29614 12.92 118.6 86.0
17 Slovenia 25739 5.27 88.0 80.2
18 Czech Republic 23102 2.72 77.3 191.8
19 Slovakia 19329 8.80 67.4 83.5
20 Hungary 16476 3.78 100.7 76.8
21 Poland 15595 3.26 78.1 98.2
22 Mexico 9863 0.06 58.6 68.7
23 Turkey 9043 2.04 19.3 13.0
24 Brazil 8717 0.81 82.4 60.0
Strange, you might think? There is only 25 countries (counting from 0). Also, let’s look at the actual correlation between columns, which is the output of the second print statement.
S$ 1.000000
Kg meat/person 0.392070
Liter beer/person -0.021863
Long-term unemployment rate -0.086968
Name: US$, dtype: float64
Correlations are quite low. It correlates the most with meat, but still not that much.
Step 3: Let’s read the types of merge available
Reading the documentation of merge, you will notice there are four types of merge.
- left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
- right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
- outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
- inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
We also see that inner merge is the default. So what does inner merge do?
It means, it will only merge on keys which exists for both DataFrames. Translated to our tables, it means, that the only remaining rows in the final merged table is the ones which exists for all 4 tables.
You can check that, it is the 25 countries listed there.
Step 4: Understand what we should do
What we are doing in the end is correlate to the GDP per capita. Hence, it only makes sense to keep the values that have a GDP.
Consider we used outer merge, then we will keep all combinations. That would not give any additional value to the calculations we want to do.
But let’s just try it and investigate the output.
import pandas as pd
# This is simply used to display all the data and not get a small window of it
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 1000)
# This is a helper function, read the URL, get the right table, drop some columns
def read_table(url, table_number, drop_columns):
tables = pd.read_html(url)
table = tables[table_number]
table = table.drop(drop_columns, axis=1)
return table
# GDP per capita
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
table = read_table(url, 3, ['Rank'])
table.rename(columns={'Country/Territory': 'Country'}, inplace=True)
# Long-term unemployement
url = 'https://en.wikipedia.org/wiki/List_of_OECD_countries_by_long-term_unemployment_rate'
table_join = read_table(url, 0, ['Long-term unemployment rate (2012)[1]'])
table_join.rename(columns={'Country/Territory': 'Country', 'Long-term unemployment rate (2016)[1]': 'Long-term unemployment rate'}, inplace=True)
index = 'Long-term unemployment rate'
table_join[index] = table_join[index].str[:-1].astype(float)
table = pd.merge(table, table_join, how='outer')
# Meat consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
table_join = read_table(url, 1, ['Kg/person (2009)[10]'])
table_join.rename(columns={'Kg/person (2002)[9][note 1]': 'Kg meat/person'}, inplace=True)
table = pd.merge(table, table_join, how='outer')
# Beer consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_beer_consumption_per_capita'
table_join = read_table(url, 2, ['2018change(litres per year)', 'Total nationalconsumption[a](million litresper year)', 'Year', 'Sources'])
table_join.rename(columns={'Consumptionper capita[1](litres per year)': 'Liter beer/person'}, inplace=True)
table = pd.merge(table, table_join, how='outer')
print(table)
# Calculate the correlation
table_corr = table.corr()
# Print the correlation to GDP per capita (stored in US$).
print(table_corr['US$'].sort_values(ascending=False))
First of all, this keeps all the output. I will not put it here, but only show a few lines.
Country US$ Long-term unemployment rate Kg meat/person Liter beer/person
0 Monaco (2018) 185741.0 NaN NaN NaN
1 Liechtenstein (2017) 173356.0 NaN NaN NaN
2 Luxembourg 114705.0 1.60 141.7 NaN
222 United States of America NaN NaN 124.8 NaN
223 United States Virgin Islands NaN NaN 6.6 NaN
224 Venezuela NaN NaN 56.6 NaN
225 Taiwan NaN NaN NaN 23.2
As the sample lines above shows, we get a row if one of them column is not NaN. Before when we used inner we would only get lines when all columns were not NaN.
The output of the correlation is now.
US$ 1.000000
Kg meat/person 0.706692
Liter beer/person 0.305120
Long-term unemployment rate -0.249958
Name: US$, dtype: float64
This is different values than from the previous example. Surprised? Not really. Now we have more data to correlate.
Step 5: Do the correct thing
If we inspect the code, we can see that the we start by having the GDP table on the left side. This growing table is always kept on the left side. Hence, we should be able to merge with left. Notice that this should not affect the final result.
Let’s try it.
import pandas as pd
# This is simply used to display all the data and not get a small window of it
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 1000)
# This is a helper function, read the URL, get the right table, drop some columns
def read_table(url, table_number, drop_columns):
tables = pd.read_html(url)
table = tables[table_number]
table = table.drop(drop_columns, axis=1)
return table
# GDP per capita
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
table = read_table(url, 3, ['Rank'])
table.rename(columns={'Country/Territory': 'Country'}, inplace=True)
# Long-term unemployement
url = 'https://en.wikipedia.org/wiki/List_of_OECD_countries_by_long-term_unemployment_rate'
table_join = read_table(url, 0, ['Long-term unemployment rate (2012)[1]'])
table_join.rename(columns={'Country/Territory': 'Country', 'Long-term unemployment rate (2016)[1]': 'Long-term unemployment rate'}, inplace=True)
index = 'Long-term unemployment rate'
table_join[index] = table_join[index].str[:-1].astype(float)
table = pd.merge(table, table_join, how='left')
# Meat consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
table_join = read_table(url, 1, ['Kg/person (2009)[10]'])
table_join.rename(columns={'Kg/person (2002)[9][note 1]': 'Kg meat/person'}, inplace=True)
table = pd.merge(table, table_join, how='left')
# Beer consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_beer_consumption_per_capita'
table_join = read_table(url, 2, ['2018change(litres per year)', 'Total nationalconsumption[a](million litresper year)', 'Year', 'Sources'])
table_join.rename(columns={'Consumptionper capita[1](litres per year)': 'Liter beer/person'}, inplace=True)
table = pd.merge(table, table_join, how='left')
print(table)
# Calculate the correlation
table_corr = table.corr()
# Print the correlation to GDP per capita (stored in US$).
print(table_corr['US$'].sort_values(ascending=False))
Resulting in the same final print statement.
US$ 1.000000
Kg meat/person 0.706692
Liter beer/person 0.305120
Long-term unemployment rate -0.249958
Name: US$, dtype: float64
Question: What does the data tell us?
Good question. What does our finding tell us? Let’s inspect the final output.
US$ 1.000000
Kg meat/person 0.706692
Liter beer/person 0.305120
Long-term unemployment rate -0.249958
Name: US$, dtype: float64
The row with US$ shows the full correlation to GDP per capita, which obviously has 100% (1.00000) correlation to GDP per capita, as it is the number itself.
The second row tells us that eating a lot of meat is highly correlated to GDP per capita. Does that then mean that a country should encourage all citizens to eat more meat to become richer? No, you cannot conclude that. It is probably the other way around. The richer a country is, the more meat they eat.
The last line tells us that long-term unemployment is negative related to GDP per capita. It is not surprising. It means, the more long-term unemployed people, the less GDP per capita. But it is not highly correlated, only (approximately) -25%.
Surprisingly, it seems to have bigger positive impact to drink a lot of beer, then it has negative impact of long-term unemployment.
What a wonderful world.
Python for Finance: Unlock Financial Freedom and Build Your Dream Life
Discover the key to financial freedom and secure your dream life with Python for Finance!
Say goodbye to financial anxiety and embrace a future filled with confidence and success. If you’re tired of struggling to pay bills and longing for a life of leisure, it’s time to take action.
Imagine breaking free from that dead-end job and opening doors to endless opportunities. With Python for Finance, you can acquire the invaluable skill of financial analysis that will revolutionize your life.
Make informed investment decisions, unlock the secrets of business financial performance, and maximize your money like never before. Gain the knowledge sought after by companies worldwide and become an indispensable asset in today’s competitive market.
Don’t let your dreams slip away. Master Python for Finance and pave your way to a profitable and fulfilling career. Start building the future you deserve today!
Python for Finance a 21 hours course that teaches investing with Python.
Learn pandas, NumPy, Matplotlib for Financial Analysis & learn how to Automate Value Investing.
“Excellent course for anyone trying to learn coding and investing.” – Lorenzo B.
