We will continue our exploration of the amazing Pandas-datareader. In this tutorial we will further investigate data from World Bank and correlate it with S&P 500 Stock index. We will do this both by visualizing 3 graphs on 2 different y-axis as well as compute the correlation.
In this tutorial we will only look at data from United States. If you are interested in other tutorials on. World Bank data you should read this one and this one.
To get the data of the World Bank you can use the Pandas-datareader, which has a function to download data if you have the indicator.
p
andas_datareader.wb.download
(country=None, indicator=None, start=2003, end=2005, freq=None, errors=’warn’, **kwargs)
That takes the country and indicator, start, and end year as arguments.
You can find indicators on the webpage of World Bank.
In this tutorial we will use the SL.UEM.TOTL.ZS, the unemployment, total (% of total labor force), and FR.INR.RINR, the interest rate.
To inspect the data you can use the following code.
from pandas_datareader import wb
data = wb.download(indicator=['SL.UEM.TOTL.ZS', 'FR.INR.RINR'], country=['US'], start=1990, end=2019)
uem_data = data.unstack().T.loc['SL.UEM.TOTL.ZS']
uem_data.columns = ['US-unempl']
int_data = data.unstack().T.loc['FR.INR.RINR']
int_data.columns = ['US-int']
data = int_data.join(uem_data)
print(data)
Giving an output similar to this (lines excluded).
US-int US-unempl
year
1990 6.039744 NaN
1991 4.915352 6.800
1992 3.884240 7.500
1993 3.546689 6.900
1994 4.898356 6.119
1995 6.594069 5.650
1996 6.324008 5.451
1997 6.603407 5.000
1998 7.148192 4.510
1999 6.457135 4.219
For details on the unstacking and transposing, see this tutorial.
First let’s get the data from S&P 500, which has ticker ^GSPC.
You can use the Pandas-datareader for that.
import pandas_datareader as pdr
import datetime as dt
start = dt.datetime(1990, 1, 1)
end = dt.datetime(2019, 12, 31)
sp500 = pdr.get_data_yahoo("^GSPC", start, end)['Adj Close']
sp500.name='S&P 500'
print(sp500)
Resulting in the following output.
Date
1990-01-02 359.690002
1990-01-03 358.760010
1990-01-04 355.670013
1990-01-05 352.200012
1990-01-08 353.790009
...
2019-12-24 3223.379883
2019-12-26 3239.909912
2019-12-27 3240.020020
2019-12-30 3221.290039
2019-12-31 3230.780029
Name: S&P 500, Length: 7559, dtype: float64
Problem! The date is a datetime object in the above Series, while it is a string with a year in the DataFrame with unemployment rate and interest rate above.
To successfully join them, we need to convert them into same format. The best way is to convert them into a datetime. We can do that by using the pd.to_datetime() function.
import pandas_datareader as pdr
import pandas as pd
import datetime as dt
from pandas_datareader import wb
data = wb.download(indicator=['SL.UEM.TOTL.ZS', 'FR.INR.RINR'], country=['US'], start=1990, end=2019)
uem_data = data.unstack().T.loc['SL.UEM.TOTL.ZS']
uem_data.columns = ['US-unempl']
int_data = data.unstack().T.loc['FR.INR.RINR']
int_data.columns = ['US-int']
data = int_data.join(uem_data)
data.index = pd.to_datetime(data.index, format='%Y')
start = dt.datetime(1990, 1, 1)
end = dt.datetime(2019, 12, 31)
sp500 = pdr.get_data_yahoo("^GSPC", start, end)['Adj Close']
sp500.name='S&P 500'
data = sp500.to_frame().join(data, how='outer')
print(data)
Resulting in the following output.
S&P 500 US-int US-unempl
1990-01-01 NaN 6.039744 NaN
1990-01-02 359.690002 NaN NaN
1990-01-03 358.760010 NaN NaN
1990-01-04 355.670013 NaN NaN
1990-01-05 352.200012 NaN NaN
... ... ... ...
2019-12-24 3223.379883 NaN NaN
2019-12-26 3239.909912 NaN NaN
2019-12-27 3240.020020 NaN NaN
2019-12-30 3221.290039 NaN NaN
2019-12-31 3230.780029 NaN NaN
The problem you see there, is that data from US-int and US-unempl. only has data the first of January every year. To fix that, we can make a linear interpolation of the data by applying the following.
data = data.interpolate(method='linear')
Resulting in.
S&P 500 US-int US-unempl
1990-01-01 NaN 6.039744 NaN
1990-01-02 359.690002 6.035318 NaN
1990-01-03 358.760010 6.030891 NaN
1990-01-04 355.670013 6.026464 NaN
1990-01-05 352.200012 6.022037 NaN
... ... ... ...
2019-12-24 3223.379883 3.478200 3.682
2019-12-26 3239.909912 3.478200 3.682
2019-12-27 3240.020020 3.478200 3.682
2019-12-30 3221.290039 3.478200 3.682
2019-12-31 3230.780029 3.478200 3.682
Notice, that since there is no unemployment data for 1990 in US, it will fill them with NaN until first rate is given.
Now here Pandas are quite strong. By default, you can create a secondary y-axis. As the 3 datasets only need two y-axis, as the unemployment and interest rate can share the same y-axis.
import pandas_datareader as pdr
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt
from pandas_datareader import wb
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 1000)
data = wb.download(indicator=['SL.UEM.TOTL.ZS', 'FR.INR.RINR'], country=['US'], start=1990, end=2019)
uem_data = data.unstack().T.loc['SL.UEM.TOTL.ZS']
uem_data.columns = ['US-unempl']
int_data = data.unstack().T.loc['FR.INR.RINR']
int_data.columns = ['US-int']
data = int_data.join(uem_data)
data.index = pd.to_datetime(data.index, format='%Y')
start = dt.datetime(1990, 1, 1)
end = dt.datetime(2019, 12, 31)
sp500 = pdr.get_data_yahoo("^GSPC", start, end)['Adj Close']
sp500.name='S&P 500'
data = sp500.to_frame().join(data, how='outer')
data = data.interpolate(method='linear')
ax = data['S&P 500'].plot(legend=True)
ax = data[['US-int','US-unempl']].plot(ax=ax, secondary_y=True, legend=True)
print(data.corr())
plt.show()
Where the correlation is given here.
S&P 500 US-int US-unempl
S&P 500 1.000000 -0.408429 -0.453315
US-int -0.408429 1.000000 -0.470103
US-unempl -0.453315 -0.470103 1.000000
Which is surprisingly low. Visually, you can see it here.
Build and Deploy an AI App with Python Flask, OpenAI API, and Google Cloud: In…
Python REST APIs with gcloud Serverless In the fast-paced world of application development, building robust…
App Development with Python using Docker Are you an aspiring app developer looking to level…
Why Value-driven Data Science is the Key to Your Success In the world of data…
Harnessing the Power of Project-Based Learning and Python for Machine Learning Mastery In today's data-driven…
Is Python the right choice for Machine Learning? Should you learn Python for Machine Learning?…