Pandas DataFrame Merge: Inner, Outer, Left, and Right

What will we cover in this tutorial?

A key process in Data Science is to merge data from various sources. This can be challenging and often needs clarity. Here we will take some simple example and explain the differences of how to merge data using the pandas library‘s DataFrame object merge function.

The key ways to merge is by inner, outer, left, and right.

In this example we are going to explore what correlates the most to GDP per capita: yearly meet consumption, yearly beer consumption, or long-term unemployment.

What is your educated guess? (no cheating, the result is down below)

Step 1: The data we want to merge

That means we need to gather the specified data.

The GDP per capita can be found on wikipedia.org. As we are going to do a lot of the same code again and again, let’s make a helper function to get the data, index the correct table, and drop the data we do not use in our analysis.

This can be done like this.

import pandas as pd


# This is simply used to display all the data and not get a small window of it
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 1000)


# This is a helper function, read the URL, get the right table, drop some columns
def read_table(url, table_number, drop_columns):
    tables = pd.read_html(url)
    table = tables[table_number]
    table = table.drop(drop_columns, axis=1)
    return table


# GDP per capita
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
table = read_table(url, 3, ['Rank'])
table.rename(columns={'Country/Territory': 'Country'}, inplace=True)

print(table)

Which results in this output (or the few first lines of it).

                                    Country     US$
0                             Monaco (2018)  185741
1                      Liechtenstein (2017)  173356
2                                Luxembourg  114705
3                                     Macau   84096
4                               Switzerland   81994
5                                   Ireland   78661
6                                    Norway   75420
7                                   Iceland   66945

Comparing this to wikipedia.org.

From wikipedia.org

We can identify that this is the middle GDP, based on the World Bank.

Then we need data from the other sources. Here we get it for long-term unemployment (long-term unemployment is defined to be unemployed for 1 year or more).

# Long-term unemployement
url = 'https://en.wikipedia.org/wiki/List_of_OECD_countries_by_long-term_unemployment_rate'
table_join = read_table(url, 0, ['Long-term unemployment rate (2012)[1]'])
table_join.rename(columns={'Country/Territory': 'Country', 'Long-term unemployment rate (2016)[1]': 'Long-term unemployment rate'}, inplace=True)
index = 'Long-term unemployment rate'
table_join[index] = table_join[index].str[:-1].astype(float)

print(table_join)

Resulting in the following output

           Country  Long-term unemployment rate
0        Australia                         1.32
1          Austria                         1.53
2          Belgium                         4.26
3           Brazil                         0.81
4           Canada                         0.89
5            Chile                         1.67
6   Czech Republic                         2.72
7          Denmark                         1.66

This can be done for the two other dimensions we want to explore as well. We will skip it here, as the full code comes later.

Step 2: Simple merge it together

What happens if we merge the data together without considering which type or merge?

Skip reading the documentation also. Let’s just do it.

import pandas as pd


# This is simply used to display all the data and not get a small window of it
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 1000)


# This is a helper function, read the URL, get the right table, drop some columns
def read_table(url, table_number, drop_columns):
    tables = pd.read_html(url)
    table = tables[table_number]
    table = table.drop(drop_columns, axis=1)
    return table


# GDP per capita
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
table = read_table(url, 3, ['Rank'])
table.rename(columns={'Country/Territory': 'Country'}, inplace=True)

# Long-term unemployement
url = 'https://en.wikipedia.org/wiki/List_of_OECD_countries_by_long-term_unemployment_rate'
table_join = read_table(url, 0, ['Long-term unemployment rate (2012)[1]'])
table_join.rename(columns={'Country/Territory': 'Country', 'Long-term unemployment rate (2016)[1]': 'Long-term unemployment rate'}, inplace=True)
index = 'Long-term unemployment rate'
table_join[index] = table_join[index].str[:-1].astype(float)

table = pd.merge(table, table_join)

# Meat consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
table_join = read_table(url, 1, ['Kg/person (2009)[10]'])
table_join.rename(columns={'Kg/person (2002)[9][note 1]': 'Kg meat/person'}, inplace=True)

table = pd.merge(table, table_join)

# Beer consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_beer_consumption_per_capita'
table_join = read_table(url, 2, ['2018change(litres per year)', 'Total nationalconsumption[a](million litresper year)', 'Year', 'Sources'])
table_join.rename(columns={'Consumptionper capita[1](litres per year)': 'Liter beer/person'}, inplace=True)

table = pd.merge(table, table_join)


print(table)

# Calculate the correlation
table_corr = table.corr()

# Print the correlation to GDP per capita (stored in US$).
print(table_corr['US$'].sort_values(ascending=False))

Which result in the output from the first print statement to be (this is the full output).

           Country    US$  Long-term unemployment rate  Kg meat/person  Liter beer/person
0      Switzerland  81994                         1.71            72.9               55.5
1          Ireland  78661                         6.68           106.3               95.8
2          Denmark  59822                         1.66           145.9               59.6
3        Australia  54907                         1.32           108.2               76.3
4      Netherlands  52448                         2.98            89.3               78.1
5          Austria  50277                         1.53            94.1              107.6
6          Finland  48686                         1.97            67.4               76.7
7          Germany  46259                         2.21            82.1              101.1
8           Canada  46195                         0.89           108.1               55.7
9          Belgium  46117                         4.26            86.1               67.0
10          Israel  43641                         0.63            97.1               17.4
11  United Kingdom  42300                         2.22            79.6               72.9
12     New Zealand  42084                         0.78           142.1               65.5
13          France  40494                         4.21           101.1               33.0
14           Japan  40247                         1.36            45.9               41.4
15           Italy  33190                         7.79            90.4               31.0
16           Spain  29614                        12.92           118.6               86.0
17        Slovenia  25739                         5.27            88.0               80.2
18  Czech Republic  23102                         2.72            77.3              191.8
19        Slovakia  19329                         8.80            67.4               83.5
20         Hungary  16476                         3.78           100.7               76.8
21          Poland  15595                         3.26            78.1               98.2
22          Mexico   9863                         0.06            58.6               68.7
23          Turkey   9043                         2.04            19.3               13.0
24          Brazil   8717                         0.81            82.4               60.0

Strange, you might think? There is only 25 countries (counting from 0). Also, let’s look at the actual correlation between columns, which is the output of the second print statement.

S$                            1.000000
Kg meat/person                 0.392070
Liter beer/person             -0.021863
Long-term unemployment rate   -0.086968
Name: US$, dtype: float64

Correlations are quite low. It correlates the most with meat, but still not that much.

Step 3: Let’s read the types of merge available

Reading the documentation of merge, you will notice there are four types of merge.

  • left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
  • right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
  • outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
  • inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

We also see that inner merge is the default. So what does inner merge do?

It means, it will only merge on keys which exists for both DataFrames. Translated to our tables, it means, that the only remaining rows in the final merged table is the ones which exists for all 4 tables.

You can check that, it is the 25 countries listed there.

Step 4: Understand what we should do

What we are doing in the end is correlate to the GDP per capita. Hence, it only makes sense to keep the values that have a GDP.

Consider we used outer merge, then we will keep all combinations. That would not give any additional value to the calculations we want to do.

But let’s just try it and investigate the output.

import pandas as pd


# This is simply used to display all the data and not get a small window of it
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 1000)


# This is a helper function, read the URL, get the right table, drop some columns
def read_table(url, table_number, drop_columns):
    tables = pd.read_html(url)
    table = tables[table_number]
    table = table.drop(drop_columns, axis=1)
    return table


# GDP per capita
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
table = read_table(url, 3, ['Rank'])
table.rename(columns={'Country/Territory': 'Country'}, inplace=True)

# Long-term unemployement
url = 'https://en.wikipedia.org/wiki/List_of_OECD_countries_by_long-term_unemployment_rate'
table_join = read_table(url, 0, ['Long-term unemployment rate (2012)[1]'])
table_join.rename(columns={'Country/Territory': 'Country', 'Long-term unemployment rate (2016)[1]': 'Long-term unemployment rate'}, inplace=True)
index = 'Long-term unemployment rate'
table_join[index] = table_join[index].str[:-1].astype(float)

table = pd.merge(table, table_join, how='outer')

# Meat consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
table_join = read_table(url, 1, ['Kg/person (2009)[10]'])
table_join.rename(columns={'Kg/person (2002)[9][note 1]': 'Kg meat/person'}, inplace=True)

table = pd.merge(table, table_join, how='outer')

# Beer consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_beer_consumption_per_capita'
table_join = read_table(url, 2, ['2018change(litres per year)', 'Total nationalconsumption[a](million litresper year)', 'Year', 'Sources'])
table_join.rename(columns={'Consumptionper capita[1](litres per year)': 'Liter beer/person'}, inplace=True)

table = pd.merge(table, table_join, how='outer')


print(table)

# Calculate the correlation
table_corr = table.corr()

# Print the correlation to GDP per capita (stored in US$).
print(table_corr['US$'].sort_values(ascending=False))

First of all, this keeps all the output. I will not put it here, but only show a few lines.

                                    Country       US$  Long-term unemployment rate  Kg meat/person  Liter beer/person
0                             Monaco (2018)  185741.0                          NaN             NaN                NaN
1                      Liechtenstein (2017)  173356.0                          NaN             NaN                NaN
2                                Luxembourg  114705.0                         1.60           141.7                NaN
222                United States of America       NaN                          NaN           124.8                NaN
223            United States Virgin Islands       NaN                          NaN             6.6                NaN
224                               Venezuela       NaN                          NaN            56.6                NaN
225                                  Taiwan       NaN                          NaN             NaN               23.2

As the sample lines above shows, we get a row if one of them column is not NaN. Before when we used inner we would only get lines when all columns were not NaN.

The output of the correlation is now.

US$                            1.000000
Kg meat/person                 0.706692
Liter beer/person              0.305120
Long-term unemployment rate   -0.249958
Name: US$, dtype: float64

This is different values than from the previous example. Surprised? Not really. Now we have more data to correlate.

Step 5: Do the correct thing

If we inspect the code, we can see that the we start by having the GDP table on the left side. This growing table is always kept on the left side. Hence, we should be able to merge with left. Notice that this should not affect the final result.

Let’s try it.

import pandas as pd


# This is simply used to display all the data and not get a small window of it
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 1000)


# This is a helper function, read the URL, get the right table, drop some columns
def read_table(url, table_number, drop_columns):
    tables = pd.read_html(url)
    table = tables[table_number]
    table = table.drop(drop_columns, axis=1)
    return table


# GDP per capita
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
table = read_table(url, 3, ['Rank'])
table.rename(columns={'Country/Territory': 'Country'}, inplace=True)

# Long-term unemployement
url = 'https://en.wikipedia.org/wiki/List_of_OECD_countries_by_long-term_unemployment_rate'
table_join = read_table(url, 0, ['Long-term unemployment rate (2012)[1]'])
table_join.rename(columns={'Country/Territory': 'Country', 'Long-term unemployment rate (2016)[1]': 'Long-term unemployment rate'}, inplace=True)
index = 'Long-term unemployment rate'
table_join[index] = table_join[index].str[:-1].astype(float)

table = pd.merge(table, table_join, how='left')

# Meat consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
table_join = read_table(url, 1, ['Kg/person (2009)[10]'])
table_join.rename(columns={'Kg/person (2002)[9][note 1]': 'Kg meat/person'}, inplace=True)

table = pd.merge(table, table_join, how='left')

# Beer consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_beer_consumption_per_capita'
table_join = read_table(url, 2, ['2018change(litres per year)', 'Total nationalconsumption[a](million litresper year)', 'Year', 'Sources'])
table_join.rename(columns={'Consumptionper capita[1](litres per year)': 'Liter beer/person'}, inplace=True)

table = pd.merge(table, table_join, how='left')


print(table)

# Calculate the correlation
table_corr = table.corr()

# Print the correlation to GDP per capita (stored in US$).
print(table_corr['US$'].sort_values(ascending=False))

Resulting in the same final print statement.

US$                            1.000000
Kg meat/person                 0.706692
Liter beer/person              0.305120
Long-term unemployment rate   -0.249958
Name: US$, dtype: float64

Question: What does the data tell us?

Good question. What does our finding tell us? Let’s inspect the final output.

US$                            1.000000
Kg meat/person                 0.706692
Liter beer/person              0.305120
Long-term unemployment rate   -0.249958
Name: US$, dtype: float64

The row with US$ shows the full correlation to GDP per capita, which obviously has 100% (1.00000) correlation to GDP per capita, as it is the number itself.

The second row tells us that eating a lot of meat is highly correlated to GDP per capita. Does that then mean that a country should encourage all citizens to eat more meat to become richer? No, you cannot conclude that. It is probably the other way around. The richer a country is, the more meat they eat.

The last line tells us that long-term unemployment is negative related to GDP per capita. It is not surprising. It means, the more long-term unemployed people, the less GDP per capita. But it is not highly correlated, only (approximately) -25%.

Surprisingly, it seems to have bigger positive impact to drink a lot of beer, then it has negative impact of long-term unemployment.

What a wonderful world.

Leave a Reply