Pandas Correlation Methods Explained: Pearson, Kendall, and Spearman

What will we cover in this tutorial?

In this tutorial we will on a live example investigate and understand the differences between the 3 methods to calculate correlation using Pandas DataFrame corr() function.

The purpose of this tutorial is to get a better understanding of these correlations, while working on real data.

Step 1: Getting some data to play with

The data we want to investigate for correlations is the US GDP, S&P 500, Gold and Oil prices. We will only focus on recent time (from 2000-2020), as the prices for Gold and Oil are not available further back on Yahoo! finance. We will get the US GDP from World Bank and the rest from Yahoo! finance.

We will be using the Pandas-datareader to retrieve the data. For a more in-depth introduction to how to use them, we will refer you to this tutorial.

import pandas_datareader as pdf
import datetime as dt
from pandas_datareader import wb


start = dt.datetime(2000, 1, 1)
end = dt.datetime.now()
tickers = pdf.get_data_yahoo(["^GSPC", "GC=F", "CL=F"], start, end)['Adj Close']

gdp = wb.download(indicator='NY.GDP.MKTP.CD', country='US', start=2000, end=2019)
gdp = gdp.reset_index(1).set_index('year')
gdp.index = pd.to_datetime(gdp.index, format="%Y")

data = gdp.join(tickers, how='outer')
data = data.interpolate(method='linear')
data = data.dropna()

data.columns = ["US GDP", "S&P 500", "Gold", "Oil"]

print(data)

Resulting in the following output.

Python 3.8.2 (default, Feb 26 2020, 02:56:10)
                  US GDP      S&P 500         Gold        Oil
2000-08-30  1.047113e+13  1502.589966   273.899994  33.400002
2000-08-31  1.047243e+13  1517.680054   278.299988  33.099998
2000-09-01  1.047373e+13  1520.770020   277.000000  33.380001
2000-09-05  1.047503e+13  1507.079956   275.799988  33.799999
2000-09-06  1.047634e+13  1492.250000   274.200012  34.950001
...                  ...          ...          ...        ...
2020-08-05  2.142770e+13  3327.770020  2031.099976  42.189999
2020-08-06  2.142770e+13  3349.159912  2051.500000  41.950001
2020-08-07  2.142770e+13  3351.280029  2046.099976  41.599998
2020-08-09  2.142770e+13  3351.280029  2037.099976  41.590000
2020-08-10  2.142770e+13  3351.280029  2043.900024  41.889999

Where we see the data we want to investigate for correlations.

Step 2: Investigate Pearson correlation coefficients

Looking at the corr() function on DataFrames it calculate the pairwise correlation between columns and returns a correlation matrix.

The default method is the Pearson correlation coefficient method. As we will see in this tutorial, correlations can be calculated differently. The Pearson is trying to correlate through a straight line between the variables.

The best way to understand that is by using an example.

Let’s first calculate the correlation matrix using the Pearson method and then try to visualize it to understand it better. You can get the correlation method simply by calling corr() on the DataFrame.

print(data.corr())

As it is the default method you do not need to set it be pearson. The output will be.

           US GDP   S&P 500      Gold       Oil
US GDP   1.000000  0.897376  0.817294  0.237426
S&P 500  0.897376  1.000000  0.581576 -0.015951
Gold     0.817294  0.581576  1.000000  0.534163
Oil      0.237426 -0.015951  0.534163  1.000000

A few words on a correlation matrix. The output of the correlation function is a number from -1 to 1. Some high-level interpretations of the output.

  • -1: A full negative correlation. Meaning if variable goes up, the other variable goes down and they are fully correlated.
  • 0: No correlation at all. Meaning that the two variables are not dependent at all. If one goes up, you cannot predict with any probability what will happen to the other.
  • 1: A full correlation. Meaning if the one variable goes up, so will the other.

Numbers between are just indication how much they are dependet.

Looking at the above output, you see that US GDP fully correlates to US GDP. This is obvious, as it is the same variable. Next we have a 0.897376 correlation between US GDP and S&P 500 stock market index. This tells us that there is a high correlation.

Now to be a bit more specific. This correlation is linear.

That means it can be fitted well with a straight line. Let’s try to visualize that.

import matplotlib.pyplot as plt
import numpy as np


# Pearson fit (default method)
fit = np.polyfit(x=data['US GDP'], y=data['S&P 500'], deg=1)
line_fit = np.poly1d(fit)
plt.plot(data['US GDP'], line_fit(data['US GDP']))
plt.scatter(x=data['US GDP'], y=data['S&P 500'], color='red', alpha=0.1)
plt.title("Pearson correlation")
plt.show()

Resulting in the following fit.

Also, let’s investigate something that does not fit well, the US GDP with Oil prices.

import matplotlib.pyplot as plt
import numpy as np


# Pearson fit (default method)
fit = np.polyfit(x=data['US GDP'], y=data['Oil'], deg=1)
line_fit = np.poly1d(fit)
plt.plot(data['US GDP'], line_fit(data['Oil']))
plt.scatter(x=data['US GDP'], y=data['Oil'], color='red', alpha=0.1)
plt.title("Pearson correlation")
plt.show()

As you can see visually, this does not fit as well to a straight line as the above example. The closer the markers are to a fitted straight line, the higher score of the correlation using Pearson. This is independent on the slope of the line, except if the slope is positive (resulting in positive values) or negative (resulting in negative values).

Just some notes to consider about Pearson correlation coefficient. The requirement of the variables being normally distributed is controversial and outside the scope of this tutorial. That said, be careful concluding based on the result. It might be an indicator, but do not conclude any linear correlations or not based on the result.

Step 3: Investigating the Kendall rank correlation coefficients

The Kendall rank correlation coefficient does not assume a normal distribution of the variables and is looking for a monotonic relationship between two variables.

Two variables are monotonic correlated if any greater value of the one variable will result in a greater value of the other variable. If the variables is negatively monotonic correlated, then it is opposite.

The correlation can be calculated as follows.

print(data.corr(method="kendall"))

Resulting in the following output.

           US GDP   S&P 500      Gold       Oil
US GDP   1.000000  0.703141  0.685002  0.249430
S&P 500  0.703141  1.000000  0.426406  0.122434
Gold     0.685002  0.426406  1.000000  0.413298
Oil      0.249430  0.122434  0.413298  1.000000

Which interestingly shows that the Pearson correlation coefficient of US GDP and S&P 500 is higher than the Kendall rank correlation.

As a rule thumb, a correlation less than 0.8 (or greater than -0.8) is considered insignificant and not strongly correlated. This means, that the correlation of US GDP and S&P 500 seems to have a linear correlation but not a strong monotonic correlation.

Remember that these are two different measures and can not be directly compared. As they measure different aspects, it is not surprising. The Pearson method can be thought of how close the points are to a fitted line, while the Kendall method looks if the one variable grows, does the other. As you see on the map, this seems not to be the case. There are many instances where it does not happen.

Step 4: Investigating the Spearman rank correlation

Spearman is closely related to Kendall, and measures whether the variables are monotonically correlated.

The Spearman rank correlation can be computed by the following.

print(data.corr(method="spearman"))

And results in the following output.

           US GDP   S&P 500      Gold       Oil
US GDP   1.000000  0.846197  0.837650  0.317295
S&P 500  0.846197  1.000000  0.609104  0.178937
Gold     0.837650  0.609104  1.000000  0.558569
Oil      0.317295  0.178937  0.558569  1.000000

Which actually is a bit more optimistic about the monotonic correlation between the US GDP and S&P 500.

Can we then conclude that when US GDP goes up, the S&P 500 goes up? Good question. The short answer is no. Example that might make it more understandable. In summer time ice cream sales go up. But also, in summer time sun glass sales goes up. Does that mean that higher ice cream sales implies higher sun glass sales? Not really. It is the factor that there is more sun that affect it.

The same can be true for correlations you find in data. Just think of it as an indicator that they somehow might be connected (or not, if value is close to 0).

Step 5: When to use what?

This is a good question.

  • Pearson correlation coefficient is in general considered stronger as has higher assumptions on data. On the negative, it only considers a full linear dependence (fitting to a straight line) and in (theory) requires the variables to be normally distributed. It is very fragile to outliers (single points far away from the norm).
  • Kendall rank correlation coefficient should be more efficient with smaller sets. It measures the monotonic relationship between two variables, and it is a bit slower to calculate O(n^2). It does not require the variables to be normally distributed.
  • Spearman rank correlation coefficient also measures the monotonic relationship between two variables. The speed is faster O(n log(n)). It often gives a slightly higher value than Kendalls. It also does not require the variables to be normally distributed.

Leave a Reply