Learn how you can become a Python programmer in just 12 weeks.

    We respect your privacy. Unsubscribe at anytime.

    Pandas DataFrame Merge: Inner, Outer, Left, and Right

    What will we cover in this tutorial?

    A key process in Data Science is to merge data from various sources. This can be challenging and often needs clarity. Here we will take some simple example and explain the differences of how to merge data using the pandas library‘s DataFrame object merge function.

    The key ways to merge is by inner, outer, left, and right.

    In this example we are going to explore what correlates the most to GDP per capita: yearly meet consumption, yearly beer consumption, or long-term unemployment.

    What is your educated guess? (no cheating, the result is down below)

    Step 1: The data we want to merge

    That means we need to gather the specified data.

    The GDP per capita can be found on wikipedia.org. As we are going to do a lot of the same code again and again, let’s make a helper function to get the data, index the correct table, and drop the data we do not use in our analysis.

    This can be done like this.

    import pandas as pd
    
    # This is simply used to display all the data and not get a small window of it
    pd.set_option('display.max_rows', 300)
    pd.set_option('display.max_columns', 15)
    pd.set_option('display.width', 1000)
    
    # This is a helper function, read the URL, get the right table, drop some columns
    def read_table(url, table_number, drop_columns):
        tables = pd.read_html(url)
        table = tables[table_number]
        table = table.drop(drop_columns, axis=1)
        return table
    
    # GDP per capita
    url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
    table = read_table(url, 3, ['Rank'])
    table.rename(columns={'Country/Territory': 'Country'}, inplace=True)
    print(table)
    

    Which results in this output (or the few first lines of it).

                                        Country     US$
    0                             Monaco (2018)  185741
    1                      Liechtenstein (2017)  173356
    2                                Luxembourg  114705
    3                                     Macau   84096
    4                               Switzerland   81994
    5                                   Ireland   78661
    6                                    Norway   75420
    7                                   Iceland   66945
    

    Comparing this to wikipedia.org.

    From wikipedia.org

    We can identify that this is the middle GDP, based on the World Bank.

    Then we need data from the other sources. Here we get it for long-term unemployment (long-term unemployment is defined to be unemployed for 1 year or more).

    # Long-term unemployement
    url = 'https://en.wikipedia.org/wiki/List_of_OECD_countries_by_long-term_unemployment_rate'
    table_join = read_table(url, 0, ['Long-term unemployment rate (2012)[1]'])
    table_join.rename(columns={'Country/Territory': 'Country', 'Long-term unemployment rate (2016)[1]': 'Long-term unemployment rate'}, inplace=True)
    index = 'Long-term unemployment rate'
    table_join[index] = table_join[index].str[:-1].astype(float)
    print(table_join)
    

    Resulting in the following output

               Country  Long-term unemployment rate
    0        Australia                         1.32
    1          Austria                         1.53
    2          Belgium                         4.26
    3           Brazil                         0.81
    4           Canada                         0.89
    5            Chile                         1.67
    6   Czech Republic                         2.72
    7          Denmark                         1.66
    

    This can be done for the two other dimensions we want to explore as well. We will skip it here, as the full code comes later.

    Step 2: Simple merge it together

    What happens if we merge the data together without considering which type or merge?

    Skip reading the documentation also. Let’s just do it.

    import pandas as pd
    
    # This is simply used to display all the data and not get a small window of it
    pd.set_option('display.max_rows', 300)
    pd.set_option('display.max_columns', 15)
    pd.set_option('display.width', 1000)
    
    # This is a helper function, read the URL, get the right table, drop some columns
    def read_table(url, table_number, drop_columns):
        tables = pd.read_html(url)
        table = tables[table_number]
        table = table.drop(drop_columns, axis=1)
        return table
    
    # GDP per capita
    url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
    table = read_table(url, 3, ['Rank'])
    table.rename(columns={'Country/Territory': 'Country'}, inplace=True)
    # Long-term unemployement
    url = 'https://en.wikipedia.org/wiki/List_of_OECD_countries_by_long-term_unemployment_rate'
    table_join = read_table(url, 0, ['Long-term unemployment rate (2012)[1]'])
    table_join.rename(columns={'Country/Territory': 'Country', 'Long-term unemployment rate (2016)[1]': 'Long-term unemployment rate'}, inplace=True)
    index = 'Long-term unemployment rate'
    table_join[index] = table_join[index].str[:-1].astype(float)
    table = pd.merge(table, table_join)
    # Meat consumption
    url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
    table_join = read_table(url, 1, ['Kg/person (2009)[10]'])
    table_join.rename(columns={'Kg/person (2002)[9][note 1]': 'Kg meat/person'}, inplace=True)
    table = pd.merge(table, table_join)
    # Beer consumption
    url = 'https://en.wikipedia.org/wiki/List_of_countries_by_beer_consumption_per_capita'
    table_join = read_table(url, 2, ['2018change(litres per year)', 'Total nationalconsumption[a](million litresper year)', 'Year', 'Sources'])
    table_join.rename(columns={'Consumptionper capita[1](litres per year)': 'Liter beer/person'}, inplace=True)
    table = pd.merge(table, table_join)
    
    print(table)
    # Calculate the correlation
    table_corr = table.corr()
    # Print the correlation to GDP per capita (stored in US$).
    print(table_corr['US$'].sort_values(ascending=False))
    

    Which result in the output from the first print statement to be (this is the full output).

               Country    US$  Long-term unemployment rate  Kg meat/person  Liter beer/person
    0      Switzerland  81994                         1.71            72.9               55.5
    1          Ireland  78661                         6.68           106.3               95.8
    2          Denmark  59822                         1.66           145.9               59.6
    3        Australia  54907                         1.32           108.2               76.3
    4      Netherlands  52448                         2.98            89.3               78.1
    5          Austria  50277                         1.53            94.1              107.6
    6          Finland  48686                         1.97            67.4               76.7
    7          Germany  46259                         2.21            82.1              101.1
    8           Canada  46195                         0.89           108.1               55.7
    9          Belgium  46117                         4.26            86.1               67.0
    10          Israel  43641                         0.63            97.1               17.4
    11  United Kingdom  42300                         2.22            79.6               72.9
    12     New Zealand  42084                         0.78           142.1               65.5
    13          France  40494                         4.21           101.1               33.0
    14           Japan  40247                         1.36            45.9               41.4
    15           Italy  33190                         7.79            90.4               31.0
    16           Spain  29614                        12.92           118.6               86.0
    17        Slovenia  25739                         5.27            88.0               80.2
    18  Czech Republic  23102                         2.72            77.3              191.8
    19        Slovakia  19329                         8.80            67.4               83.5
    20         Hungary  16476                         3.78           100.7               76.8
    21          Poland  15595                         3.26            78.1               98.2
    22          Mexico   9863                         0.06            58.6               68.7
    23          Turkey   9043                         2.04            19.3               13.0
    24          Brazil   8717                         0.81            82.4               60.0
    

    Strange, you might think? There is only 25 countries (counting from 0). Also, let’s look at the actual correlation between columns, which is the output of the second print statement.

    S$                            1.000000
    Kg meat/person                 0.392070
    Liter beer/person             -0.021863
    Long-term unemployment rate   -0.086968
    Name: US$, dtype: float64
    

    Correlations are quite low. It correlates the most with meat, but still not that much.

    Step 3: Let’s read the types of merge available

    Reading the documentation of merge, you will notice there are four types of merge.

    • left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
    • right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
    • outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
    • inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

    We also see that inner merge is the default. So what does inner merge do?

    It means, it will only merge on keys which exists for both DataFrames. Translated to our tables, it means, that the only remaining rows in the final merged table is the ones which exists for all 4 tables.

    You can check that, it is the 25 countries listed there.

    Step 4: Understand what we should do

    What we are doing in the end is correlate to the GDP per capita. Hence, it only makes sense to keep the values that have a GDP.

    Consider we used outer merge, then we will keep all combinations. That would not give any additional value to the calculations we want to do.

    But let’s just try it and investigate the output.

    import pandas as pd
    
    # This is simply used to display all the data and not get a small window of it
    pd.set_option('display.max_rows', 300)
    pd.set_option('display.max_columns', 15)
    pd.set_option('display.width', 1000)
    
    # This is a helper function, read the URL, get the right table, drop some columns
    def read_table(url, table_number, drop_columns):
        tables = pd.read_html(url)
        table = tables[table_number]
        table = table.drop(drop_columns, axis=1)
        return table
    
    # GDP per capita
    url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
    table = read_table(url, 3, ['Rank'])
    table.rename(columns={'Country/Territory': 'Country'}, inplace=True)
    # Long-term unemployement
    url = 'https://en.wikipedia.org/wiki/List_of_OECD_countries_by_long-term_unemployment_rate'
    table_join = read_table(url, 0, ['Long-term unemployment rate (2012)[1]'])
    table_join.rename(columns={'Country/Territory': 'Country', 'Long-term unemployment rate (2016)[1]': 'Long-term unemployment rate'}, inplace=True)
    index = 'Long-term unemployment rate'
    table_join[index] = table_join[index].str[:-1].astype(float)
    table = pd.merge(table, table_join, how='outer')
    # Meat consumption
    url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
    table_join = read_table(url, 1, ['Kg/person (2009)[10]'])
    table_join.rename(columns={'Kg/person (2002)[9][note 1]': 'Kg meat/person'}, inplace=True)
    table = pd.merge(table, table_join, how='outer')
    # Beer consumption
    url = 'https://en.wikipedia.org/wiki/List_of_countries_by_beer_consumption_per_capita'
    table_join = read_table(url, 2, ['2018change(litres per year)', 'Total nationalconsumption[a](million litresper year)', 'Year', 'Sources'])
    table_join.rename(columns={'Consumptionper capita[1](litres per year)': 'Liter beer/person'}, inplace=True)
    table = pd.merge(table, table_join, how='outer')
    
    print(table)
    # Calculate the correlation
    table_corr = table.corr()
    # Print the correlation to GDP per capita (stored in US$).
    print(table_corr['US$'].sort_values(ascending=False))
    

    First of all, this keeps all the output. I will not put it here, but only show a few lines.

                                        Country       US$  Long-term unemployment rate  Kg meat/person  Liter beer/person
    0                             Monaco (2018)  185741.0                          NaN             NaN                NaN
    1                      Liechtenstein (2017)  173356.0                          NaN             NaN                NaN
    2                                Luxembourg  114705.0                         1.60           141.7                NaN
    222                United States of America       NaN                          NaN           124.8                NaN
    223            United States Virgin Islands       NaN                          NaN             6.6                NaN
    224                               Venezuela       NaN                          NaN            56.6                NaN
    225                                  Taiwan       NaN                          NaN             NaN               23.2
    

    As the sample lines above shows, we get a row if one of them column is not NaN. Before when we used inner we would only get lines when all columns were not NaN.

    The output of the correlation is now.

    US$                            1.000000
    Kg meat/person                 0.706692
    Liter beer/person              0.305120
    Long-term unemployment rate   -0.249958
    Name: US$, dtype: float64
    

    This is different values than from the previous example. Surprised? Not really. Now we have more data to correlate.

    Step 5: Do the correct thing

    If we inspect the code, we can see that the we start by having the GDP table on the left side. This growing table is always kept on the left side. Hence, we should be able to merge with left. Notice that this should not affect the final result.

    Let’s try it.

    import pandas as pd
    
    # This is simply used to display all the data and not get a small window of it
    pd.set_option('display.max_rows', 300)
    pd.set_option('display.max_columns', 15)
    pd.set_option('display.width', 1000)
    
    # This is a helper function, read the URL, get the right table, drop some columns
    def read_table(url, table_number, drop_columns):
        tables = pd.read_html(url)
        table = tables[table_number]
        table = table.drop(drop_columns, axis=1)
        return table
    
    # GDP per capita
    url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
    table = read_table(url, 3, ['Rank'])
    table.rename(columns={'Country/Territory': 'Country'}, inplace=True)
    # Long-term unemployement
    url = 'https://en.wikipedia.org/wiki/List_of_OECD_countries_by_long-term_unemployment_rate'
    table_join = read_table(url, 0, ['Long-term unemployment rate (2012)[1]'])
    table_join.rename(columns={'Country/Territory': 'Country', 'Long-term unemployment rate (2016)[1]': 'Long-term unemployment rate'}, inplace=True)
    index = 'Long-term unemployment rate'
    table_join[index] = table_join[index].str[:-1].astype(float)
    table = pd.merge(table, table_join, how='left')
    # Meat consumption
    url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
    table_join = read_table(url, 1, ['Kg/person (2009)[10]'])
    table_join.rename(columns={'Kg/person (2002)[9][note 1]': 'Kg meat/person'}, inplace=True)
    table = pd.merge(table, table_join, how='left')
    # Beer consumption
    url = 'https://en.wikipedia.org/wiki/List_of_countries_by_beer_consumption_per_capita'
    table_join = read_table(url, 2, ['2018change(litres per year)', 'Total nationalconsumption[a](million litresper year)', 'Year', 'Sources'])
    table_join.rename(columns={'Consumptionper capita[1](litres per year)': 'Liter beer/person'}, inplace=True)
    table = pd.merge(table, table_join, how='left')
    
    print(table)
    # Calculate the correlation
    table_corr = table.corr()
    # Print the correlation to GDP per capita (stored in US$).
    print(table_corr['US$'].sort_values(ascending=False))
    

    Resulting in the same final print statement.

    US$                            1.000000
    Kg meat/person                 0.706692
    Liter beer/person              0.305120
    Long-term unemployment rate   -0.249958
    Name: US$, dtype: float64
    

    Question: What does the data tell us?

    Good question. What does our finding tell us? Let’s inspect the final output.

    US$                            1.000000
    Kg meat/person                 0.706692
    Liter beer/person              0.305120
    Long-term unemployment rate   -0.249958
    Name: US$, dtype: float64
    

    The row with US$ shows the full correlation to GDP per capita, which obviously has 100% (1.00000) correlation to GDP per capita, as it is the number itself.

    The second row tells us that eating a lot of meat is highly correlated to GDP per capita. Does that then mean that a country should encourage all citizens to eat more meat to become richer? No, you cannot conclude that. It is probably the other way around. The richer a country is, the more meat they eat.

    The last line tells us that long-term unemployment is negative related to GDP per capita. It is not surprising. It means, the more long-term unemployed people, the less GDP per capita. But it is not highly correlated, only (approximately) -25%.

    Surprisingly, it seems to have bigger positive impact to drink a lot of beer, then it has negative impact of long-term unemployment.

    What a wonderful world.

    Python for Finance: Unlock Financial Freedom and Build Your Dream Life

    Discover the key to financial freedom and secure your dream life with Python for Finance!

    Say goodbye to financial anxiety and embrace a future filled with confidence and success. If you’re tired of struggling to pay bills and longing for a life of leisure, it’s time to take action.

    Imagine breaking free from that dead-end job and opening doors to endless opportunities. With Python for Finance, you can acquire the invaluable skill of financial analysis that will revolutionize your life.

    Make informed investment decisions, unlock the secrets of business financial performance, and maximize your money like never before. Gain the knowledge sought after by companies worldwide and become an indispensable asset in today’s competitive market.

    Don’t let your dreams slip away. Master Python for Finance and pave your way to a profitable and fulfilling career. Start building the future you deserve today!

    Python for Finance a 21 hours course that teaches investing with Python.

    Learn pandas, NumPy, Matplotlib for Financial Analysis & learn how to Automate Value Investing.

    “Excellent course for anyone trying to learn coding and investing.” – Lorenzo B.

    Leave a Comment