Pandas DataFrame Merge: Inner, Outer, Left, and Right

What will we cover in this tutorial?

A key process in Data Science is to merge data from various sources. This can be challenging and often needs clarity. Here we will take some simple example and explain the differences of how to merge data using the pandas library‘s DataFrame object merge function.

The key ways to merge is by inner, outer, left, and right.

In this example we are going to explore what correlates the most to GDP per capita: yearly meet consumption, yearly beer consumption, or long-term unemployment.

What is your educated guess? (no cheating, the result is down below)

Step 1: The data we want to merge

That means we need to gather the specified data.

The GDP per capita can be found on wikipedia.org. As we are going to do a lot of the same code again and again, let’s make a helper function to get the data, index the correct table, and drop the data we do not use in our analysis.

This can be done like this.

import pandas as pd


# This is simply used to display all the data and not get a small window of it
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 1000)


# This is a helper function, read the URL, get the right table, drop some columns
def read_table(url, table_number, drop_columns):
    tables = pd.read_html(url)
    table = tables[table_number]
    table = table.drop(drop_columns, axis=1)
    return table


# GDP per capita
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
table = read_table(url, 3, ['Rank'])
table.rename(columns={'Country/Territory': 'Country'}, inplace=True)

print(table)

Which results in this output (or the few first lines of it).

                                    Country     US$
0                             Monaco (2018)  185741
1                      Liechtenstein (2017)  173356
2                                Luxembourg  114705
3                                     Macau   84096
4                               Switzerland   81994
5                                   Ireland   78661
6                                    Norway   75420
7                                   Iceland   66945

Comparing this to wikipedia.org.

From wikipedia.org

We can identify that this is the middle GDP, based on the World Bank.

Then we need data from the other sources. Here we get it for long-term unemployment (long-term unemployment is defined to be unemployed for 1 year or more).

# Long-term unemployement
url = 'https://en.wikipedia.org/wiki/List_of_OECD_countries_by_long-term_unemployment_rate'
table_join = read_table(url, 0, ['Long-term unemployment rate (2012)[1]'])
table_join.rename(columns={'Country/Territory': 'Country', 'Long-term unemployment rate (2016)[1]': 'Long-term unemployment rate'}, inplace=True)
index = 'Long-term unemployment rate'
table_join[index] = table_join[index].str[:-1].astype(float)

print(table_join)

Resulting in the following output

           Country  Long-term unemployment rate
0        Australia                         1.32
1          Austria                         1.53
2          Belgium                         4.26
3           Brazil                         0.81
4           Canada                         0.89
5            Chile                         1.67
6   Czech Republic                         2.72
7          Denmark                         1.66

This can be done for the two other dimensions we want to explore as well. We will skip it here, as the full code comes later.

Step 2: Simple merge it together

What happens if we merge the data together without considering which type or merge?

Skip reading the documentation also. Let’s just do it.

import pandas as pd


# This is simply used to display all the data and not get a small window of it
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 1000)


# This is a helper function, read the URL, get the right table, drop some columns
def read_table(url, table_number, drop_columns):
    tables = pd.read_html(url)
    table = tables[table_number]
    table = table.drop(drop_columns, axis=1)
    return table


# GDP per capita
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
table = read_table(url, 3, ['Rank'])
table.rename(columns={'Country/Territory': 'Country'}, inplace=True)

# Long-term unemployement
url = 'https://en.wikipedia.org/wiki/List_of_OECD_countries_by_long-term_unemployment_rate'
table_join = read_table(url, 0, ['Long-term unemployment rate (2012)[1]'])
table_join.rename(columns={'Country/Territory': 'Country', 'Long-term unemployment rate (2016)[1]': 'Long-term unemployment rate'}, inplace=True)
index = 'Long-term unemployment rate'
table_join[index] = table_join[index].str[:-1].astype(float)

table = pd.merge(table, table_join)

# Meat consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
table_join = read_table(url, 1, ['Kg/person (2009)[10]'])
table_join.rename(columns={'Kg/person (2002)[9][note 1]': 'Kg meat/person'}, inplace=True)

table = pd.merge(table, table_join)

# Beer consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_beer_consumption_per_capita'
table_join = read_table(url, 2, ['2018change(litres per year)', 'Total nationalconsumption[a](million litresper year)', 'Year', 'Sources'])
table_join.rename(columns={'Consumptionper capita[1](litres per year)': 'Liter beer/person'}, inplace=True)

table = pd.merge(table, table_join)


print(table)

# Calculate the correlation
table_corr = table.corr()

# Print the correlation to GDP per capita (stored in US$).
print(table_corr['US$'].sort_values(ascending=False))

Which result in the output from the first print statement to be (this is the full output).

           Country    US$  Long-term unemployment rate  Kg meat/person  Liter beer/person
0      Switzerland  81994                         1.71            72.9               55.5
1          Ireland  78661                         6.68           106.3               95.8
2          Denmark  59822                         1.66           145.9               59.6
3        Australia  54907                         1.32           108.2               76.3
4      Netherlands  52448                         2.98            89.3               78.1
5          Austria  50277                         1.53            94.1              107.6
6          Finland  48686                         1.97            67.4               76.7
7          Germany  46259                         2.21            82.1              101.1
8           Canada  46195                         0.89           108.1               55.7
9          Belgium  46117                         4.26            86.1               67.0
10          Israel  43641                         0.63            97.1               17.4
11  United Kingdom  42300                         2.22            79.6               72.9
12     New Zealand  42084                         0.78           142.1               65.5
13          France  40494                         4.21           101.1               33.0
14           Japan  40247                         1.36            45.9               41.4
15           Italy  33190                         7.79            90.4               31.0
16           Spain  29614                        12.92           118.6               86.0
17        Slovenia  25739                         5.27            88.0               80.2
18  Czech Republic  23102                         2.72            77.3              191.8
19        Slovakia  19329                         8.80            67.4               83.5
20         Hungary  16476                         3.78           100.7               76.8
21          Poland  15595                         3.26            78.1               98.2
22          Mexico   9863                         0.06            58.6               68.7
23          Turkey   9043                         2.04            19.3               13.0
24          Brazil   8717                         0.81            82.4               60.0

Strange, you might think? There is only 25 countries (counting from 0). Also, let’s look at the actual correlation between columns, which is the output of the second print statement.

S$                            1.000000
Kg meat/person                 0.392070
Liter beer/person             -0.021863
Long-term unemployment rate   -0.086968
Name: US$, dtype: float64

Correlations are quite low. It correlates the most with meat, but still not that much.

Step 3: Let’s read the types of merge available

Reading the documentation of merge, you will notice there are four types of merge.

  • left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
  • right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
  • outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
  • inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

We also see that inner merge is the default. So what does inner merge do?

It means, it will only merge on keys which exists for both DataFrames. Translated to our tables, it means, that the only remaining rows in the final merged table is the ones which exists for all 4 tables.

You can check that, it is the 25 countries listed there.

Step 4: Understand what we should do

What we are doing in the end is correlate to the GDP per capita. Hence, it only makes sense to keep the values that have a GDP.

Consider we used outer merge, then we will keep all combinations. That would not give any additional value to the calculations we want to do.

But let’s just try it and investigate the output.

import pandas as pd


# This is simply used to display all the data and not get a small window of it
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 1000)


# This is a helper function, read the URL, get the right table, drop some columns
def read_table(url, table_number, drop_columns):
    tables = pd.read_html(url)
    table = tables[table_number]
    table = table.drop(drop_columns, axis=1)
    return table


# GDP per capita
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
table = read_table(url, 3, ['Rank'])
table.rename(columns={'Country/Territory': 'Country'}, inplace=True)

# Long-term unemployement
url = 'https://en.wikipedia.org/wiki/List_of_OECD_countries_by_long-term_unemployment_rate'
table_join = read_table(url, 0, ['Long-term unemployment rate (2012)[1]'])
table_join.rename(columns={'Country/Territory': 'Country', 'Long-term unemployment rate (2016)[1]': 'Long-term unemployment rate'}, inplace=True)
index = 'Long-term unemployment rate'
table_join[index] = table_join[index].str[:-1].astype(float)

table = pd.merge(table, table_join, how='outer')

# Meat consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
table_join = read_table(url, 1, ['Kg/person (2009)[10]'])
table_join.rename(columns={'Kg/person (2002)[9][note 1]': 'Kg meat/person'}, inplace=True)

table = pd.merge(table, table_join, how='outer')

# Beer consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_beer_consumption_per_capita'
table_join = read_table(url, 2, ['2018change(litres per year)', 'Total nationalconsumption[a](million litresper year)', 'Year', 'Sources'])
table_join.rename(columns={'Consumptionper capita[1](litres per year)': 'Liter beer/person'}, inplace=True)

table = pd.merge(table, table_join, how='outer')


print(table)

# Calculate the correlation
table_corr = table.corr()

# Print the correlation to GDP per capita (stored in US$).
print(table_corr['US$'].sort_values(ascending=False))

First of all, this keeps all the output. I will not put it here, but only show a few lines.

                                    Country       US$  Long-term unemployment rate  Kg meat/person  Liter beer/person
0                             Monaco (2018)  185741.0                          NaN             NaN                NaN
1                      Liechtenstein (2017)  173356.0                          NaN             NaN                NaN
2                                Luxembourg  114705.0                         1.60           141.7                NaN
222                United States of America       NaN                          NaN           124.8                NaN
223            United States Virgin Islands       NaN                          NaN             6.6                NaN
224                               Venezuela       NaN                          NaN            56.6                NaN
225                                  Taiwan       NaN                          NaN             NaN               23.2

As the sample lines above shows, we get a row if one of them column is not NaN. Before when we used inner we would only get lines when all columns were not NaN.

The output of the correlation is now.

US$                            1.000000
Kg meat/person                 0.706692
Liter beer/person              0.305120
Long-term unemployment rate   -0.249958
Name: US$, dtype: float64

This is different values than from the previous example. Surprised? Not really. Now we have more data to correlate.

Step 5: Do the correct thing

If we inspect the code, we can see that the we start by having the GDP table on the left side. This growing table is always kept on the left side. Hence, we should be able to merge with left. Notice that this should not affect the final result.

Let’s try it.

import pandas as pd


# This is simply used to display all the data and not get a small window of it
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 1000)


# This is a helper function, read the URL, get the right table, drop some columns
def read_table(url, table_number, drop_columns):
    tables = pd.read_html(url)
    table = tables[table_number]
    table = table.drop(drop_columns, axis=1)
    return table


# GDP per capita
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
table = read_table(url, 3, ['Rank'])
table.rename(columns={'Country/Territory': 'Country'}, inplace=True)

# Long-term unemployement
url = 'https://en.wikipedia.org/wiki/List_of_OECD_countries_by_long-term_unemployment_rate'
table_join = read_table(url, 0, ['Long-term unemployment rate (2012)[1]'])
table_join.rename(columns={'Country/Territory': 'Country', 'Long-term unemployment rate (2016)[1]': 'Long-term unemployment rate'}, inplace=True)
index = 'Long-term unemployment rate'
table_join[index] = table_join[index].str[:-1].astype(float)

table = pd.merge(table, table_join, how='left')

# Meat consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
table_join = read_table(url, 1, ['Kg/person (2009)[10]'])
table_join.rename(columns={'Kg/person (2002)[9][note 1]': 'Kg meat/person'}, inplace=True)

table = pd.merge(table, table_join, how='left')

# Beer consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_beer_consumption_per_capita'
table_join = read_table(url, 2, ['2018change(litres per year)', 'Total nationalconsumption[a](million litresper year)', 'Year', 'Sources'])
table_join.rename(columns={'Consumptionper capita[1](litres per year)': 'Liter beer/person'}, inplace=True)

table = pd.merge(table, table_join, how='left')


print(table)

# Calculate the correlation
table_corr = table.corr()

# Print the correlation to GDP per capita (stored in US$).
print(table_corr['US$'].sort_values(ascending=False))

Resulting in the same final print statement.

US$                            1.000000
Kg meat/person                 0.706692
Liter beer/person              0.305120
Long-term unemployment rate   -0.249958
Name: US$, dtype: float64

Question: What does the data tell us?

Good question. What does our finding tell us? Let’s inspect the final output.

US$                            1.000000
Kg meat/person                 0.706692
Liter beer/person              0.305120
Long-term unemployment rate   -0.249958
Name: US$, dtype: float64

The row with US$ shows the full correlation to GDP per capita, which obviously has 100% (1.00000) correlation to GDP per capita, as it is the number itself.

The second row tells us that eating a lot of meat is highly correlated to GDP per capita. Does that then mean that a country should encourage all citizens to eat more meat to become richer? No, you cannot conclude that. It is probably the other way around. The richer a country is, the more meat they eat.

The last line tells us that long-term unemployment is negative related to GDP per capita. It is not surprising. It means, the more long-term unemployed people, the less GDP per capita. But it is not highly correlated, only (approximately) -25%.

Surprisingly, it seems to have bigger positive impact to drink a lot of beer, then it has negative impact of long-term unemployment.

What a wonderful world.

Python Twitter Bot to Unfollow Friends that do not Follow Back – 3 Easy Steps

What will we cover in this tutorial?

  • How to unfollow friends in Twitter Python that do not follow back.
  • The process to get the access tokens to use the Twitter API
  • How to connect to the twitter API
  • The actual implementation of the code.

This is updated to Tweepy version 4.4

Step 1: Setup up environment

In order to get a connection to twitter you need to have access tokens and consumer keys. If you don’t already have that, or you do not know what it is, then I recommend you follow this tutorial.

You also need the tweepy library. You can install it by typing the following command in the command line or see here for more details.

pip install tweepy

Then you are ready to connect to the Twitter API.’

Step 2: Connecting to Twitter API

The first thing your code should do is to connect to the Twitter API and return the tweepy api to your program.

import tweepy
 
def get_twitter_api():
    # personal details
    consumer_key = "__USE YOUR KEY HERE__"
    consumer_secret = "__USE YOUR KEY HERE__"
    access_token = "__USE YOUR KEY HERE__"
    access_token_secret = "__USE YOUR KEY HERE__"
 
    # authentication of consumer key and secret
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
 
    # authentication of access token and secret
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth, wait_on_rate_limit=True)
    return api

This code will authenticate and return the tweepy api.

Step 3: List friends and followers to unfollow non-following friends

Confused by the headline? Me, too. But here is where the magic happens. 

The code simply explained.

  • Retrieves all the users that follow you (followers)
  • Retrieves those you follow (friends)
  • Loops through friends and check if they follow you
    • If not, unfollow them with a call to destroy_friendship
def process():
    api = get_twitter_api()
 
    followers = api.get_followers_ids(user_id="PythonWithRune")
    print("Followers", len(followers))
    friends = api.get_friends_ids(user_id="PythonWithRune")
    print("You follow:", len(friends))
 
    for friend in friends:
        if friend not in followers:
            api.destroy_friendship(friend)

Full code example here

You can see the full code here to unfollow friends that do not follow back in Twitter using Python

import tweepy
 
 
def get_twitter_api():
    # personal details
    consumer_key = "__USE YOUR KEY HERE__"
    consumer_secret = "__USE YOUR KEY HERE__"
    access_token = "__USE YOUR KEY HERE__"
    access_token_secret = "__USE YOUR KEY HERE__"
 
    # authentication of consumer key and secret
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
 
    # authentication of access token and secret
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth, wait_on_rate_limit=True)
    return api
 
 
def process():
    api = get_twitter_api()
 
    followers = api.get_followers_ids(user_id="PythonWithRune")
    print("Followers", len(followers))
    friends = api.get_friends_ids(user_id="PythonWithRune")
    print("You follow:", len(friends))
 
    for friend in friends:
        if friend not in followers:
            api.destroy_friendship(friend) 
 
if __name__ == "__main__":
    process()

Next step

  • Deploy it to an cron job so it runs every hour.
  • You can use PythonAnywhere (not sponsored by them)

Python Twitter Bot to Follow Followers – 3 Easy Steps

What will we cover in this tutorial?

  • To build a Bot to Follow Followers in Twitter using Python
  • Link to how you can get your access tokens and consumer keys to get access to the Twitter API (needed)
  • How to access the Twitter API
  • Finally, full code example of a Python Twitter Bot to follow the followers your account does not follow already.

Step 1: Setup up environment

In order to get a connection to twitter you need to have access tokens and consumer keys. If you don’t already have that, or you do not know what it is, then I recommend you follow this tutorial.

You also need the tweepy library. You can install it by typing the following command in the command line or see here for more details.

pip install tweepy

Then you are ready to connect to the Twitter API.

Step 2: Connecting to Twitter API

The first thing your code should do is to connect to the Twitter API and return the tweepy api to your program.

import tweepy

def get_twitter_api():
    # personal details
    consumer_key = "__USE YOUR KEY HERE__"
    consumer_secret = "__USE YOUR KEY HERE__"
    access_token = "__USE YOUR KEY HERE__"
    access_token_secret = "__USE YOUR KEY HERE__"

    # authentication of consumer key and secret
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

    # authentication of access token and secret
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth, wait_on_rate_limit=True)
    return api

This code will authenticate and return the tweepy api.

Step 3: List followers and friends to follow back

Confused by the headline? Me, too. But here is where the magic happens.

The code simply explained.

  • Retrieves all the users that follow you (followers)
  • Retrieves those you follow (friends)
  • Loops through followers and check if you follow them.
    • If not, follow them back
def process():
    api = get_twitter_api()

    followers = api.followers_ids(api.me().id)
    print("Followers", len(followers))
    friends = api.friends_ids(api.me().id)
    print("You follow:", len(friends))

    for follower in followers:
        if follower not in friends:
            api.create_friendship(follower)

Full code example here

You can see the full code here.

import tweepy


def get_twitter_api():
    # personal details
    consumer_key = "__USE YOUR KEY HERE__"
    consumer_secret = "__USE YOUR KEY HERE__"
    access_token = "__USE YOUR KEY HERE__"
    access_token_secret = "__USE YOUR KEY HERE__"

    # authentication of consumer key and secret
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

    # authentication of access token and secret
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth, wait_on_rate_limit=True)
    return api


def process():
    api = get_twitter_api()

    followers = api.followers_ids(api.me().id)
    print("Followers", len(followers))
    friends = api.friends_ids(api.me().id)
    print("You follow:", len(friends))

    for follower in followers:
        if follower not in friends:
            api.create_friendship(follower)


if __name__ == "__main__":
    process()

Next steps

  • Deploy it to an cron job so it runs every hour.
  • You can use PythonAnywhere (not sponsored by them)
Exit mobile version