Learn how you can become a Python programmer in just 12 weeks.

    We respect your privacy. Unsubscribe at anytime.

    Pandas Correlation Methods Explained: Pearson, Kendall, and Spearman

    What will we cover in this tutorial?

    In this tutorial we will on a live example investigate and understand the differences between the 3 methods to calculate correlation using Pandas DataFrame corr() function.

    The purpose of this tutorial is to get a better understanding of these correlations, while working on real data.

    Step 1: Getting some data to play with

    The data we want to investigate for correlations is the US GDP, S&P 500, Gold and Oil prices. We will only focus on recent time (from 2000-2020), as the prices for Gold and Oil are not available further back on Yahoo! finance. We will get the US GDP from World Bank and the rest from Yahoo! finance.

    We will be using the Pandas-datareader to retrieve the data. For a more in-depth introduction to how to use them, we will refer you to this tutorial.

    import pandas_datareader as pdf
    import datetime as dt
    from pandas_datareader import wb
    
    start = dt.datetime(2000, 1, 1)
    end = dt.datetime.now()
    tickers = pdf.get_data_yahoo(["^GSPC", "GC=F", "CL=F"], start, end)['Adj Close']
    gdp = wb.download(indicator='NY.GDP.MKTP.CD', country='US', start=2000, end=2019)
    gdp = gdp.reset_index(1).set_index('year')
    gdp.index = pd.to_datetime(gdp.index, format="%Y")
    data = gdp.join(tickers, how='outer')
    data = data.interpolate(method='linear')
    data = data.dropna()
    data.columns = ["US GDP", "S&P 500", "Gold", "Oil"]
    print(data)
    

    Resulting in the following output.

    Python 3.8.2 (default, Feb 26 2020, 02:56:10)
                      US GDP      S&P 500         Gold        Oil
    2000-08-30  1.047113e+13  1502.589966   273.899994  33.400002
    2000-08-31  1.047243e+13  1517.680054   278.299988  33.099998
    2000-09-01  1.047373e+13  1520.770020   277.000000  33.380001
    2000-09-05  1.047503e+13  1507.079956   275.799988  33.799999
    2000-09-06  1.047634e+13  1492.250000   274.200012  34.950001
    ...                  ...          ...          ...        ...
    2020-08-05  2.142770e+13  3327.770020  2031.099976  42.189999
    2020-08-06  2.142770e+13  3349.159912  2051.500000  41.950001
    2020-08-07  2.142770e+13  3351.280029  2046.099976  41.599998
    2020-08-09  2.142770e+13  3351.280029  2037.099976  41.590000
    2020-08-10  2.142770e+13  3351.280029  2043.900024  41.889999
    

    Where we see the data we want to investigate for correlations.

    Step 2: Investigate Pearson correlation coefficients

    Looking at the corr() function on DataFrames it calculate the pairwise correlation between columns and returns a correlation matrix.

    The default method is the Pearson correlation coefficient method. As we will see in this tutorial, correlations can be calculated differently. The Pearson is trying to correlate through a straight line between the variables.

    The best way to understand that is by using an example.

    Let’s first calculate the correlation matrix using the Pearson method and then try to visualize it to understand it better. You can get the correlation method simply by calling corr() on the DataFrame.

    print(data.corr())
    

    As it is the default method you do not need to set it be pearson. The output will be.

               US GDP   S&P 500      Gold       Oil
    US GDP   1.000000  0.897376  0.817294  0.237426
    S&P 500  0.897376  1.000000  0.581576 -0.015951
    Gold     0.817294  0.581576  1.000000  0.534163
    Oil      0.237426 -0.015951  0.534163  1.000000
    

    A few words on a correlation matrix. The output of the correlation function is a number from -1 to 1. Some high-level interpretations of the output.

    • -1: A full negative correlation. Meaning if variable goes up, the other variable goes down and they are fully correlated.
    • 0: No correlation at all. Meaning that the two variables are not dependent at all. If one goes up, you cannot predict with any probability what will happen to the other.
    • 1: A full correlation. Meaning if the one variable goes up, so will the other.

    Numbers between are just indication how much they are dependet.

    Looking at the above output, you see that US GDP fully correlates to US GDP. This is obvious, as it is the same variable. Next we have a 0.897376 correlation between US GDP and S&P 500 stock market index. This tells us that there is a high correlation.

    Now to be a bit more specific. This correlation is linear.

    That means it can be fitted well with a straight line. Let’s try to visualize that.

    import matplotlib.pyplot as plt
    import numpy as np
    
    # Pearson fit (default method)
    fit = np.polyfit(x=data['US GDP'], y=data['S&P 500'], deg=1)
    line_fit = np.poly1d(fit)
    plt.plot(data['US GDP'], line_fit(data['US GDP']))
    plt.scatter(x=data['US GDP'], y=data['S&P 500'], color='red', alpha=0.1)
    plt.title("Pearson correlation")
    plt.show()
    

    Resulting in the following fit.

    Also, let’s investigate something that does not fit well, the US GDP with Oil prices.

    import matplotlib.pyplot as plt
    import numpy as np
    
    # Pearson fit (default method)
    fit = np.polyfit(x=data['US GDP'], y=data['Oil'], deg=1)
    line_fit = np.poly1d(fit)
    plt.plot(data['US GDP'], line_fit(data['Oil']))
    plt.scatter(x=data['US GDP'], y=data['Oil'], color='red', alpha=0.1)
    plt.title("Pearson correlation")
    plt.show()
    

    As you can see visually, this does not fit as well to a straight line as the above example. The closer the markers are to a fitted straight line, the higher score of the correlation using Pearson. This is independent on the slope of the line, except if the slope is positive (resulting in positive values) or negative (resulting in negative values).

    Just some notes to consider about Pearson correlation coefficient. The requirement of the variables being normally distributed is controversial and outside the scope of this tutorial. That said, be careful concluding based on the result. It might be an indicator, but do not conclude any linear correlations or not based on the result.

    Step 3: Investigating the Kendall rank correlation coefficients

    The Kendall rank correlation coefficient does not assume a normal distribution of the variables and is looking for a monotonic relationship between two variables.

    Two variables are monotonic correlated if any greater value of the one variable will result in a greater value of the other variable. If the variables is negatively monotonic correlated, then it is opposite.

    The correlation can be calculated as follows.

    print(data.corr(method="kendall"))
    

    Resulting in the following output.

               US GDP   S&P 500      Gold       Oil
    US GDP   1.000000  0.703141  0.685002  0.249430
    S&P 500  0.703141  1.000000  0.426406  0.122434
    Gold     0.685002  0.426406  1.000000  0.413298
    Oil      0.249430  0.122434  0.413298  1.000000
    

    Which interestingly shows that the Pearson correlation coefficient of US GDP and S&P 500 is higher than the Kendall rank correlation.

    As a rule thumb, a correlation less than 0.8 (or greater than -0.8) is considered insignificant and not strongly correlated. This means, that the correlation of US GDP and S&P 500 seems to have a linear correlation but not a strong monotonic correlation.

    Remember that these are two different measures and can not be directly compared. As they measure different aspects, it is not surprising. The Pearson method can be thought of how close the points are to a fitted line, while the Kendall method looks if the one variable grows, does the other. As you see on the map, this seems not to be the case. There are many instances where it does not happen.

    Step 4: Investigating the Spearman rank correlation

    Spearman is closely related to Kendall, and measures whether the variables are monotonically correlated.

    The Spearman rank correlation can be computed by the following.

    print(data.corr(method="spearman"))
    

    And results in the following output.

               US GDP   S&P 500      Gold       Oil
    US GDP   1.000000  0.846197  0.837650  0.317295
    S&P 500  0.846197  1.000000  0.609104  0.178937
    Gold     0.837650  0.609104  1.000000  0.558569
    Oil      0.317295  0.178937  0.558569  1.000000
    

    Which actually is a bit more optimistic about the monotonic correlation between the US GDP and S&P 500.

    Can we then conclude that when US GDP goes up, the S&P 500 goes up? Good question. The short answer is no. Example that might make it more understandable. In summer time ice cream sales go up. But also, in summer time sun glass sales goes up. Does that mean that higher ice cream sales implies higher sun glass sales? Not really. It is the factor that there is more sun that affect it.

    The same can be true for correlations you find in data. Just think of it as an indicator that they somehow might be connected (or not, if value is close to 0).

    Step 5: When to use what?

    This is a good question.

    • Pearson correlation coefficient is in general considered stronger as has higher assumptions on data. On the negative, it only considers a full linear dependence (fitting to a straight line) and in (theory) requires the variables to be normally distributed. It is very fragile to outliers (single points far away from the norm).
    • Kendall rank correlation coefficient should be more efficient with smaller sets. It measures the monotonic relationship between two variables, and it is a bit slower to calculate O(n^2). It does not require the variables to be normally distributed.
    • Spearman rank correlation coefficient also measures the monotonic relationship between two variables. The speed is faster O(n log(n)). It often gives a slightly higher value than Kendalls. It also does not require the variables to be normally distributed.

    Python Circle

    Do you know what the 5 key success factors every programmer must have?

    How is it possible that some people become programmer so fast?

    While others struggle for years and still fail.

    Not only do they learn python 10 times faster they solve complex problems with ease.

    What separates them from the rest?

    I identified these 5 success factors that every programmer must have to succeed:

    1. Collaboration: sharing your work with others and receiving help with any questions or challenges you may have.
    2. Networking: the ability to connect with the right people and leverage their knowledge, experience, and resources.
    3. Support: receive feedback on your work and ask questions without feeling intimidated or judged.
    4. Accountability: stay motivated and accountable to your learning goals by surrounding yourself with others who are also committed to learning Python.
    5. Feedback from the instructor: receiving feedback and support from an instructor with years of experience in the field.

    I know how important these success factors are for growth and progress in mastering Python.

    That is why I want to make them available to anyone struggling to learn or who just wants to improve faster.

    With the Python Circle community, you can take advantage of 5 key success factors every programmer must have.

    Python Circle
    Python Circle

    Be part of something bigger and join the Python Circle community.

    Leave a Comment