Data Science # Pandas Correlation Methods Explained: Pearson, Kendall, and Spearman

In this tutorial we will on a live example investigate and understand the differences between the 3 methods to calculate correlation using Pandas DataFrame corr() function.

- Pearson correlation coefficient: Measures the linear correlation between two variables.
- Kendall rank correlation coefficient: Measures the ordinal association between two variables.
- Spearman rank correlation coefficient: Measures if the relationship between two variables is monotonic.

The purpose of this tutorial is to get a better understanding of these correlations, while working on real data.

The data we want to investigate for correlations is the US GDP, S&P 500, Gold and Oil prices. We will only focus on recent time (from 2000-2020), as the prices for Gold and Oil are not available further back on Yahoo! finance. We will get the US GDP from World Bank and the rest from Yahoo! finance.

We will be using the Pandas-datareader to retrieve the data. For a more in-depth introduction to how to use them, we will refer you to this tutorial.

```
import pandas_datareader as pdf
import datetime as dt
from pandas_datareader import wb
start = dt.datetime(2000, 1, 1)
end = dt.datetime.now()
tickers = pdf.get_data_yahoo(["^GSPC", "GC=F", "CL=F"], start, end)['Adj Close']
gdp = wb.download(indicator='NY.GDP.MKTP.CD', country='US', start=2000, end=2019)
gdp = gdp.reset_index(1).set_index('year')
gdp.index = pd.to_datetime(gdp.index, format="%Y")
data = gdp.join(tickers, how='outer')
data = data.interpolate(method='linear')
data = data.dropna()
data.columns = ["US GDP", "S&P 500", "Gold", "Oil"]
print(data)
```

Resulting in the following output.

```
Python 3.8.2 (default, Feb 26 2020, 02:56:10)
US GDP S&P 500 Gold Oil
2000-08-30 1.047113e+13 1502.589966 273.899994 33.400002
2000-08-31 1.047243e+13 1517.680054 278.299988 33.099998
2000-09-01 1.047373e+13 1520.770020 277.000000 33.380001
2000-09-05 1.047503e+13 1507.079956 275.799988 33.799999
2000-09-06 1.047634e+13 1492.250000 274.200012 34.950001
... ... ... ... ...
2020-08-05 2.142770e+13 3327.770020 2031.099976 42.189999
2020-08-06 2.142770e+13 3349.159912 2051.500000 41.950001
2020-08-07 2.142770e+13 3351.280029 2046.099976 41.599998
2020-08-09 2.142770e+13 3351.280029 2037.099976 41.590000
2020-08-10 2.142770e+13 3351.280029 2043.900024 41.889999
```

Where we see the data we want to investigate for correlations.

Looking at the corr() function on DataFrames it calculate the pairwise correlation between columns and returns a correlation matrix.

The default method is the Pearson correlation coefficient method. As we will see in this tutorial, correlations can be calculated differently. The Pearson is trying to correlate through a straight line between the variables.

The best way to understand that is by using an example.

Let’s first calculate the correlation matrix using the Pearson method and then try to visualize it to understand it better. You can get the correlation method simply by calling **corr()** on the **DataFrame**.

```
print(data.corr())
```

As it is the default method you do not need to set it be **pearson**. The output will be.

```
US GDP S&P 500 Gold Oil
US GDP 1.000000 0.897376 0.817294 0.237426
S&P 500 0.897376 1.000000 0.581576 -0.015951
Gold 0.817294 0.581576 1.000000 0.534163
Oil 0.237426 -0.015951 0.534163 1.000000
```

A few words on a correlation matrix. The output of the correlation function is a number from -1 to 1. Some high-level interpretations of the output.

- -1: A full negative correlation. Meaning if variable goes up, the other variable goes down and they are fully correlated.
- 0: No correlation at all. Meaning that the two variables are not dependent at all. If one goes up, you cannot predict with any probability what will happen to the other.
- 1: A full correlation. Meaning if the one variable goes up, so will the other.

Numbers between are just indication how much they are dependet.

Looking at the above output, you see that US GDP fully correlates to US GDP. This is obvious, as it is the same variable. Next we have a 0.897376 correlation between US GDP and S&P 500 stock market index. This tells us that there is a high correlation.

Now to be a bit more specific. This correlation is linear.

That means it can be fitted well with a straight line. Let’s try to visualize that.

```
import matplotlib.pyplot as plt
import numpy as np
# Pearson fit (default method)
fit = np.polyfit(x=data['US GDP'], y=data['S&P 500'], deg=1)
line_fit = np.poly1d(fit)
plt.plot(data['US GDP'], line_fit(data['US GDP']))
plt.scatter(x=data['US GDP'], y=data['S&P 500'], color='red', alpha=0.1)
plt.title("Pearson correlation")
plt.show()
```

Resulting in the following fit.

Also, let’s investigate something that does not fit well, the US GDP with Oil prices.

```
import matplotlib.pyplot as plt
import numpy as np
# Pearson fit (default method)
fit = np.polyfit(x=data['US GDP'], y=data['Oil'], deg=1)
line_fit = np.poly1d(fit)
plt.plot(data['US GDP'], line_fit(data['Oil']))
plt.scatter(x=data['US GDP'], y=data['Oil'], color='red', alpha=0.1)
plt.title("Pearson correlation")
plt.show()
```

As you can see visually, this does not fit as well to a straight line as the above example. The closer the markers are to a fitted straight line, the higher score of the correlation using Pearson. This is independent on the slope of the line, except if the slope is positive (resulting in positive values) or negative (resulting in negative values).

Just some notes to consider about Pearson correlation coefficient. The requirement of the variables being normally distributed is controversial and outside the scope of this tutorial. That said, be careful concluding based on the result. It might be an indicator, but do not conclude any linear correlations or not based on the result.

The Kendall rank correlation coefficient does not assume a normal distribution of the variables and is looking for a monotonic relationship between two variables.

Two variables are monotonic correlated if any greater value of the one variable will result in a greater value of the other variable. If the variables is negatively monotonic correlated, then it is opposite.

The correlation can be calculated as follows.

```
print(data.corr(method="kendall"))
```

Resulting in the following output.

```
US GDP S&P 500 Gold Oil
US GDP 1.000000 0.703141 0.685002 0.249430
S&P 500 0.703141 1.000000 0.426406 0.122434
Gold 0.685002 0.426406 1.000000 0.413298
Oil 0.249430 0.122434 0.413298 1.000000
```

Which interestingly shows that the Pearson correlation coefficient of US GDP and S&P 500 is higher than the Kendall rank correlation.

As a rule thumb, a correlation less than 0.8 (or greater than -0.8) is considered insignificant and not strongly correlated. This means, that the correlation of US GDP and S&P 500 seems to have a linear correlation but not a strong monotonic correlation.

Remember that these are two different measures and can not be directly compared. As they measure different aspects, it is not surprising. The Pearson method can be thought of how close the points are to a fitted line, while the Kendall method looks if the one variable grows, does the other. As you see on the map, this seems not to be the case. There are many instances where it does not happen.

Spearman is closely related to Kendall, and measures whether the variables are monotonically correlated.

The Spearman rank correlation can be computed by the following.

```
print(data.corr(method="spearman"))
```

And results in the following output.

```
US GDP S&P 500 Gold Oil
US GDP 1.000000 0.846197 0.837650 0.317295
S&P 500 0.846197 1.000000 0.609104 0.178937
Gold 0.837650 0.609104 1.000000 0.558569
Oil 0.317295 0.178937 0.558569 1.000000
```

Which actually is a bit more optimistic about the monotonic correlation between the US GDP and S&P 500.

Can we then conclude that when US GDP goes up, the S&P 500 goes up? Good question. The short answer is no. Example that might make it more understandable. In summer time ice cream sales go up. But also, in summer time sun glass sales goes up. Does that mean that higher ice cream sales implies higher sun glass sales? Not really. It is the factor that there is more sun that affect it.

The same can be true for correlations you find in data. Just think of it as an indicator that they somehow might be connected (or not, if value is close to 0).

This is a good question.

**Pearson correlation coefficient**is in general considered stronger as has higher assumptions on data. On the negative, it only considers a full linear dependence (fitting to a straight line) and in (theory) requires the variables to be normally distributed. It is very fragile to outliers (single points far away from the norm).**Kendall rank correlation coefficient**should be more efficient with smaller sets. It measures the monotonic relationship between two variables, and it is a bit slower to calculate**O(n^2)**. It does not require the variables to be normally distributed.**Spearman rank correlation coefficient**also measures the monotonic relationship between two variables. The speed is faster**O(n log(n))**. It often gives a slightly higher value than Kendalls. It also does not require the variables to be normally distributed.

Why learn Python? There are many reasons to learn Python, and that is the power…

3 days ago

What will you learn? How to use the modulo operator to check if a number…

1 week ago

There are a lot of Myths out there There are lot of Myths about being…

2 months ago

To be honest, I am not really a great programmer - that is not what…

2 months ago

What does it take to become a Data Scientist? Data Science is in a cross…

2 months ago

What will you learn? Need to setup a SQL server? You don’t need to install…

4 months ago