Learn how you can become a Python programmer in just 12 weeks.

    We respect your privacy. Unsubscribe at anytime.

    Master Data Correlation with Pandas DataFrame in 3 Easy Steps

    What will we cover in this tutorial?

    • How to get data using Pandas DataFrames.
    • Clean the data and merge it together.
    • Finally, how to see if there is any correlation between data columns.

    Step 1: Get the data you want to correlate

    As an example, let’s assume you get the idea that there might be a correlation between GDP per capita, Social Progress Index (SPI), and Human Development Index (HDI), but is not sure whether SPI or HDI is closets correlated to GDP per capita.

    Luckily, you have pandas to the rescue.

    As the data is in three pages, you need to collect it by separately and merge it later. First, let us collect the data and inspect it.

    The GDP per capita is located in the table on wikipedia presented in the picture below.

    From wikipedia.org

    Which is actually three tables. We will use the World Bank table in our example. It can be collected by using a call to pandas read_html. If you are new to read_html we recommend you read this tutorial.

    import pandas as pd
    # The URL we will read our data from
    url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
    # read_html returns a list of tables from the URL
    tables = pd.read_html(url)
    table = tables[3]
    print(table)
    

    Which will give an output similar to this.

        Rank                       Country/Territory     US$
    0      1                           Monaco (2018)  185741
    1      2                    Liechtenstein (2017)  173356
    2      3                              Luxembourg  114705
    3      —                                   Macau   84096
    4      4                             Switzerland   81994
    5      5                                 Ireland   78661
    6      6                                  Norway   75420
    7      7                                 Iceland   66945
    

    The next table we need to get is the Social Progress Index (SPI) and looks like the picture shows below.

    From wikipedia.org

    This can be collected to a DataFrame with the following code.

    import pandas as pd
    url = 'https://en.wikipedia.org/wiki/Social_Progress_Index'
    tables = pd.read_html(url)
    print(tables[1])
    

    Which will print the following to the screen (or the top of it).

                              Country 2019[9]               
                              Country    Rank  Score Score.1
    0                          Norway       1  90.95     NaN
    1                         Denmark       2  90.09     NaN
    2                     Switzerland       3  89.89     NaN
    3                         Finland       4  89.56     NaN
    4                          Sweden       5  89.45     NaN
    

    Finally we need to read the Human Development Index (HDI), which can be seen on wikipedia as the following picture shows.

    From wikipedia.org

    And can be collected with the following code.

    import pandas as pd
    url = 'https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index'
    tables = pd.read_html(url)
    print(tables[1])
    

    Resulting in the following output.

                               Rank                                             Country or Territory                         HDI                                                    Unnamed: 5_level_0
        2018 data (2019 report)​[5] Change from previous year​[5]               Country or Territory 2018 data (2019 report)​[5] Average annual HDI growth (2010–2018)​[6]          Unnamed: 5_level_1
        Very high human development   Very high human development        Very high human development Very high human development               Very high human development Very high human development
    0                             1                           NaN                             Norway                       0.954                                     0.16%                         NaN
    1                             2                           NaN                        Switzerland                       0.946                                     0.18%                         NaN
    2                             3                           NaN                            Ireland                       0.942                                     0.71%                         NaN
    3                             4                           NaN                            Germany                       0.939                                     0.25%                         NaN
    4                             4                           (2)                          Hong Kong                       0.939                                     0.51%                         NaN
    5                             6                           (1)                          Australia                       0.938                                     0.17%                         NaN
    

    A bit more messy data table.

    Now we have gathered all the data we need to clean it up and merge it together.

    Step 2: Clean and merge the data into one DataFrame

    If we first inspect the data from the GDP per capita.

        Rank                       Country/Territory     US$
    0      1                           Monaco (2018)  185741
    1      2                    Liechtenstein (2017)  173356
    2      3                              Luxembourg  114705
    3      —                                   Macau   84096
    4      4                             Switzerland   81994
    5      5                                 Ireland   78661
    6      6                                  Norway   75420
    7      7                                 Iceland   66945
    

    Notice that Country/Territory can have a year in parentheses, which will make it hard to merge. We need to clean that up. Also, we do not need the Rank column.

    If we inspect the data of HDI.

                              Country 2019[9]               
                              Country    Rank  Score Score.1
    0                          Norway       1  90.95     NaN
    1                         Denmark       2  90.09     NaN
    2                     Switzerland       3  89.89     NaN
    3                         Finland       4  89.56     NaN
    4                          Sweden       5  89.45     NaN
    

    Here we notice that the first row is an additional description row, which we can remove. Further, we do not need the Rank and Score.1 columns.

    Let’s try to merge it together. Notice that we use a lambda function to clean up the Country/Territory names. If you are new to lambda functions, we recommend you read this tutorial.

    import pandas as pd
    # The URL we will read our data from
    url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
    # read_html returns a list of tables from the URL
    tables = pd.read_html(url)
    # The data is in table 3
    table = tables[3]
    # We need to clean the years in parenthesis from the country/territory field
    table['Country'] = table.apply(lambda row: row['Country/Territory'].split(' (')[0], axis=1)
    # We do not need the Rank and Country/Territory for more data
    table = table.drop(['Rank', 'Country/Territory'], axis=1)
    url = 'https://en.wikipedia.org/wiki/Social_Progress_Index'
    tables = pd.read_html(url)
    merge_table = tables[1]
    # The first level of the table can be dropped
    merge_table.columns = merge_table.columns.droplevel(0)
    # We do not need the Rank and Score.1 columns
    merge_table = merge_table.drop(['Rank', 'Score.1'], axis=1)
    # Need to rename the second column to SPI = Social Progress Index
    merge_table.columns = ['Country', 'SPI']
    # Ready to merge the tables
    table = table.merge(merge_table, how="left", left_on=['Country'], right_on=['Country'])
    print(table)
    

    Which will result in an output like this.

            US$                            Country    SPI
    0    185741                             Monaco    NaN
    1    173356                      Liechtenstein    NaN
    2    114705                         Luxembourg  87.66
    3     84096                              Macau    NaN
    4     81994                        Switzerland  89.89
    5     78661                            Ireland  87.97
    6     75420                             Norway  90.95
    

    First validate that Monaco, Liechtenstein, and Macau do not have any SPI value. That seems to be correct.

    Then we can proceed to the next table of HDI. Let us first inspect the data.

                               Rank                                             Country or Territory                         HDI                                                    Unnamed: 5_level_0
        2018 data (2019 report)​[5] Change from previous year​[5]               Country or Territory 2018 data (2019 report)​[5] Average annual HDI growth (2010–2018)​[6]          Unnamed: 5_level_1
        Very high human development   Very high human development        Very high human development Very high human development               Very high human development Very high human development
    0                             1                           NaN                             Norway                       0.954                                     0.16%                         NaN
    1                             2                           NaN                        Switzerland                       0.946                                     0.18%                         NaN
    2                             3                           NaN                            Ireland                       0.942                                     0.71%                         NaN
    3                             4                           NaN                            Germany                       0.939                                     0.25%                         NaN
    4                             4                           (2)                          Hong Kong                       0.939                                     0.51%                         NaN
    5                             6                           (1)                          Australia                       0.938                                     0.17%                         NaN
    

    It has a quite messy top level column naming in 3 layers. Dropping them will make some identical. To deal with that, we can rename them and delete those we do not need.

    import pandas as pd
    # The URL we will read our data from
    url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
    # read_html returns a list of tables from the URL
    tables = pd.read_html(url)
    # The data is in table 3
    table = tables[3]
    # We need to clean the years in parenthesis from the country/territory field
    table['Country'] = table.apply(lambda row: row['Country/Territory'].split(' (')[0], axis=1)
    # We do not need the Rank and Country/Territory for more data
    table = table.drop(['Rank', 'Country/Territory'], axis=1)
    url = 'https://en.wikipedia.org/wiki/Social_Progress_Index'
    tables = pd.read_html(url)
    merge_table = tables[1]
    # The first level of the table can be dropped
    merge_table.columns = merge_table.columns.droplevel(0)
    # We do not need the Rank and Score.1 columns
    merge_table = merge_table.drop(['Rank', 'Score.1'], axis=1)
    # Need to rename the second column to SPI = Social Progress Index
    merge_table.columns = ['Country', 'SPI']
    # Ready to merge the tables
    table = table.merge(merge_table, how="left", left_on=['Country'], right_on=['Country'])
    url = 'https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index'
    tables = pd.read_html(url)
    
    merge_table = tables[1]
    # Delete the additional column levels
    merge_table.columns = merge_table.columns.droplevel(1)
    merge_table.columns = merge_table.columns.droplevel(1)
    # Rename the columns
    merge_table.columns = ['Rank1', 'Rank2', 'Country', 'HDI', 'HDI-1', 'None']
    # Delete the columns we do not need
    merge_table = merge_table.drop(['Rank1', 'Rank2', 'HDI-1', 'None'], axis=1)
    # Merge the tables
    table = table.merge(merge_table, how="left", left_on=['Country'], right_on=['Country'])
    print(table)
    

    Which will result in the following output (or the top of it).

            US$                            Country    SPI    HDI
    0    185741                             Monaco    NaN    NaN
    1    173356                      Liechtenstein    NaN  0.917
    2    114705                         Luxembourg  87.66  0.909
    3     84096                              Macau    NaN    NaN
    4     81994                        Switzerland  89.89  0.946
    5     78661                            Ireland  87.97  0.942
    6     75420                             Norway  90.95  0.954
    

    Notice, that here Liechtenstein has HDI data, while Monaco and Macau do not have any data. While it is not visible, the HDI column is not made of float. It can be seen with a call to table.dtypes, which will output the following.

    US$          int64
    Country     object
    SPI        float64
    HDI         object
    dtype: object
    

    Which states that HDI is object, which in this case is a string. That means we need to convert it to float to make our final correlation computations. This can be done by using a lambda function.

    table['HDI'] = table.apply(lambda row: float(row['HDI']) if row['HDI'] is not np.nan else np.nan, axis=1) # HDI = Human Development Index
    

    This actually makes the data ready to see if there is any correlations between GDP per capita and SPI and/or HDI.

    Step 3: Calculate the correlations

    This is where the DataFrames from pandas come strong. It can do the entire work for you with one call to corr().

    The full code is given below.

    import pandas as pd
    import numpy as np
    
    pd.set_option('display.max_rows', 300)
    pd.set_option('display.max_columns', 10)
    pd.set_option('display.width', 1000)
    
    # The URL we will read our data from
    url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
    # read_html returns a list of tables from the URL
    tables = pd.read_html(url)
    # The data is in table 3
    table = tables[3]
    # We need to clean the years in parenthesis from the country/territory field
    table['Country'] = table.apply(lambda row: row['Country/Territory'].split(' (')[0], axis=1)
    # We do not need the Rank and Country/Territory for more data
    table = table.drop(['Rank', 'Country/Territory'], axis=1)
    url = 'https://en.wikipedia.org/wiki/Social_Progress_Index'
    tables = pd.read_html(url)
    merge_table = tables[1]
    # The first level of the table can be dropped
    merge_table.columns = merge_table.columns.droplevel(0)
    # We do not need the Rank and Score.1 columns
    merge_table = merge_table.drop(['Rank', 'Score.1'], axis=1)
    # Need to rename the second column to SPI = Social Progress Index
    merge_table.columns = ['Country', 'SPI']
    # Ready to merge the tables
    table = table.merge(merge_table, how="left", left_on=['Country'], right_on=['Country'])
    url = 'https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index'
    tables = pd.read_html(url)
    
    merge_table = tables[1]
    # Delete the additional column levels
    merge_table.columns = merge_table.columns.droplevel(1)
    merge_table.columns = merge_table.columns.droplevel(1)
    # Rename the columns
    merge_table.columns = ['Rank1', 'Rank2', 'Country', 'HDI', 'HDI-1', 'None']
    # Delete the columns we do not need
    merge_table = merge_table.drop(['Rank1', 'Rank2', 'HDI-1', 'None'], axis=1)
    # Merge the tables
    table = table.merge(merge_table, how="left", left_on=['Country'], right_on=['Country'])
    # Convert to floats
    table['HDI'] = table.apply(lambda row: float(row['HDI']) if row['HDI'] is not np.nan else np.nan, axis=1) # HDI = Human Development Index
    # Calculate the correlation
    table_corr = table.corr()
    # Print the correlation to GDP per capita (stored in US$).
    print(table_corr['US$'].sort_values(ascending=False))
    

    Which will result in the following output.

    US$    1.000000
    SPI    0.713946
    HDI    0.663183
    Name: US$, dtype: float64
    

    Hence, it seems that there is the biggest correlation between GDP per capita and SPI.

    Notice, that the calculations ignores all Not a Number (np.nan).

    Python Circle

    Do you know what the 5 key success factors every programmer must have?

    How is it possible that some people become programmer so fast?

    While others struggle for years and still fail.

    Not only do they learn python 10 times faster they solve complex problems with ease.

    What separates them from the rest?

    I identified these 5 success factors that every programmer must have to succeed:

    1. Collaboration: sharing your work with others and receiving help with any questions or challenges you may have.
    2. Networking: the ability to connect with the right people and leverage their knowledge, experience, and resources.
    3. Support: receive feedback on your work and ask questions without feeling intimidated or judged.
    4. Accountability: stay motivated and accountable to your learning goals by surrounding yourself with others who are also committed to learning Python.
    5. Feedback from the instructor: receiving feedback and support from an instructor with years of experience in the field.

    I know how important these success factors are for growth and progress in mastering Python.

    That is why I want to make them available to anyone struggling to learn or who just wants to improve faster.

    With the Python Circle community, you can take advantage of 5 key success factors every programmer must have.

    Python Circle
    Python Circle

    Be part of something bigger and join the Python Circle community.

    Leave a Comment