What will we cover in this tutorial?
In this tutorial we will on a live example investigate and understand the differences between the 3 methods to calculate correlation using Pandas DataFrame corr() function.
- Pearson correlation coefficient: Measures the linear correlation between two variables.
- Kendall rank correlation coefficient: Measures the ordinal association between two variables.
- Spearman rank correlation coefficient: Measures if the relationship between two variables is monotonic.
The purpose of this tutorial is to get a better understanding of these correlations, while working on real data.
Step 1: Getting some data to play with
The data we want to investigate for correlations is the US GDP, S&P 500, Gold and Oil prices. We will only focus on recent time (from 2000-2020), as the prices for Gold and Oil are not available further back on Yahoo! finance. We will get the US GDP from World Bank and the rest from Yahoo! finance.
import pandas_datareader as pdf import datetime as dt from pandas_datareader import wb start = dt.datetime(2000, 1, 1) end = dt.datetime.now() tickers = pdf.get_data_yahoo(["^GSPC", "GC=F", "CL=F"], start, end)['Adj Close'] gdp = wb.download(indicator='NY.GDP.MKTP.CD', country='US', start=2000, end=2019) gdp = gdp.reset_index(1).set_index('year') gdp.index = pd.to_datetime(gdp.index, format="%Y") data = gdp.join(tickers, how='outer') data = data.interpolate(method='linear') data = data.dropna() data.columns = ["US GDP", "S&P 500", "Gold", "Oil"] print(data)
Resulting in the following output.
Python 3.8.2 (default, Feb 26 2020, 02:56:10) US GDP S&P 500 Gold Oil 2000-08-30 1.047113e+13 1502.589966 273.899994 33.400002 2000-08-31 1.047243e+13 1517.680054 278.299988 33.099998 2000-09-01 1.047373e+13 1520.770020 277.000000 33.380001 2000-09-05 1.047503e+13 1507.079956 275.799988 33.799999 2000-09-06 1.047634e+13 1492.250000 274.200012 34.950001 ... ... ... ... ... 2020-08-05 2.142770e+13 3327.770020 2031.099976 42.189999 2020-08-06 2.142770e+13 3349.159912 2051.500000 41.950001 2020-08-07 2.142770e+13 3351.280029 2046.099976 41.599998 2020-08-09 2.142770e+13 3351.280029 2037.099976 41.590000 2020-08-10 2.142770e+13 3351.280029 2043.900024 41.889999
Where we see the data we want to investigate for correlations.
Step 2: Investigate Pearson correlation coefficients
Looking at the corr() function on DataFrames it calculate the pairwise correlation between columns and returns a correlation matrix.
The default method is the Pearson correlation coefficient method. As we will see in this tutorial, correlations can be calculated differently. The Pearson is trying to correlate through a straight line between the variables.
The best way to understand that is by using an example.
Let’s first calculate the correlation matrix using the Pearson method and then try to visualize it to understand it better. You can get the correlation method simply by calling corr() on the DataFrame.
As it is the default method you do not need to set it be pearson. The output will be.
US GDP S&P 500 Gold Oil US GDP 1.000000 0.897376 0.817294 0.237426 S&P 500 0.897376 1.000000 0.581576 -0.015951 Gold 0.817294 0.581576 1.000000 0.534163 Oil 0.237426 -0.015951 0.534163 1.000000
A few words on a correlation matrix. The output of the correlation function is a number from -1 to 1. Some high-level interpretations of the output.
- -1: A full negative correlation. Meaning if variable goes up, the other variable goes down and they are fully correlated.
- 0: No correlation at all. Meaning that the two variables are not dependent at all. If one goes up, you cannot predict with any probability what will happen to the other.
- 1: A full correlation. Meaning if the one variable goes up, so will the other.
Numbers between are just indication how much they are dependet.
Looking at the above output, you see that US GDP fully correlates to US GDP. This is obvious, as it is the same variable. Next we have a 0.897376 correlation between US GDP and S&P 500 stock market index. This tells us that there is a high correlation.
Now to be a bit more specific. This correlation is linear.
That means it can be fitted well with a straight line. Let’s try to visualize that.
import matplotlib.pyplot as plt import numpy as np # Pearson fit (default method) fit = np.polyfit(x=data['US GDP'], y=data['S&P 500'], deg=1) line_fit = np.poly1d(fit) plt.plot(data['US GDP'], line_fit(data['US GDP'])) plt.scatter(x=data['US GDP'], y=data['S&P 500'], color='red', alpha=0.1) plt.title("Pearson correlation") plt.show()
Resulting in the following fit.
Also, let’s investigate something that does not fit well, the US GDP with Oil prices.
import matplotlib.pyplot as plt import numpy as np # Pearson fit (default method) fit = np.polyfit(x=data['US GDP'], y=data['Oil'], deg=1) line_fit = np.poly1d(fit) plt.plot(data['US GDP'], line_fit(data['Oil'])) plt.scatter(x=data['US GDP'], y=data['Oil'], color='red', alpha=0.1) plt.title("Pearson correlation") plt.show()
As you can see visually, this does not fit as well to a straight line as the above example. The closer the markers are to a fitted straight line, the higher score of the correlation using Pearson. This is independent on the slope of the line, except if the slope is positive (resulting in positive values) or negative (resulting in negative values).
Just some notes to consider about Pearson correlation coefficient. The requirement of the variables being normally distributed is controversial and outside the scope of this tutorial. That said, be careful concluding based on the result. It might be an indicator, but do not conclude any linear correlations or not based on the result.
Step 3: Investigating the Kendall rank correlation coefficients
The Kendall rank correlation coefficient does not assume a normal distribution of the variables and is looking for a monotonic relationship between two variables.
Two variables are monotonic correlated if any greater value of the one variable will result in a greater value of the other variable. If the variables is negatively monotonic correlated, then it is opposite.
The correlation can be calculated as follows.
Resulting in the following output.
US GDP S&P 500 Gold Oil US GDP 1.000000 0.703141 0.685002 0.249430 S&P 500 0.703141 1.000000 0.426406 0.122434 Gold 0.685002 0.426406 1.000000 0.413298 Oil 0.249430 0.122434 0.413298 1.000000
Which interestingly shows that the Pearson correlation coefficient of US GDP and S&P 500 is higher than the Kendall rank correlation.
As a rule thumb, a correlation less than 0.8 (or greater than -0.8) is considered insignificant and not strongly correlated. This means, that the correlation of US GDP and S&P 500 seems to have a linear correlation but not a strong monotonic correlation.
Remember that these are two different measures and can not be directly compared. As they measure different aspects, it is not surprising. The Pearson method can be thought of how close the points are to a fitted line, while the Kendall method looks if the one variable grows, does the other. As you see on the map, this seems not to be the case. There are many instances where it does not happen.
Step 4: Investigating the Spearman rank correlation
Spearman is closely related to Kendall, and measures whether the variables are monotonically correlated.
The Spearman rank correlation can be computed by the following.
And results in the following output.
US GDP S&P 500 Gold Oil US GDP 1.000000 0.846197 0.837650 0.317295 S&P 500 0.846197 1.000000 0.609104 0.178937 Gold 0.837650 0.609104 1.000000 0.558569 Oil 0.317295 0.178937 0.558569 1.000000
Which actually is a bit more optimistic about the monotonic correlation between the US GDP and S&P 500.
Can we then conclude that when US GDP goes up, the S&P 500 goes up? Good question. The short answer is no. Example that might make it more understandable. In summer time ice cream sales go up. But also, in summer time sun glass sales goes up. Does that mean that higher ice cream sales implies higher sun glass sales? Not really. It is the factor that there is more sun that affect it.
The same can be true for correlations you find in data. Just think of it as an indicator that they somehow might be connected (or not, if value is close to 0).
Step 5: When to use what?
This is a good question.
- Pearson correlation coefficient is in general considered stronger as has higher assumptions on data. On the negative, it only considers a full linear dependence (fitting to a straight line) and in (theory) requires the variables to be normally distributed. It is very fragile to outliers (single points far away from the norm).
- Kendall rank correlation coefficient should be more efficient with smaller sets. It measures the monotonic relationship between two variables, and it is a bit slower to calculate O(n^2). It does not require the variables to be normally distributed.
- Spearman rank correlation coefficient also measures the monotonic relationship between two variables. The speed is faster O(n log(n)). It often gives a slightly higher value than Kendalls. It also does not require the variables to be normally distributed.