What will we cover in this tutorial?
We will continue our exploration of the amazing Pandas-datareader. In this tutorial we will further investigate data from World Bank and correlate it with S&P 500 Stock index. We will do this both by visualizing 3 graphs on 2 different y-axis as well as compute the correlation.
Step 1: Get World Bank data
In this tutorial we will only look at data from United States. If you are interested in other tutorials on. World Bank data you should read this one and this one.
To get the data of the World Bank you can use the Pandas-datareader, which has a function to download data if you have the indicator.
p
andas_datareader.wb.download
(country=None, indicator=None, start=2003, end=2005, freq=None, errors=’warn’, **kwargs)
That takes the country and indicator, start, and end year as arguments.
You can find indicators on the webpage of World Bank.
In this tutorial we will use the SL.UEM.TOTL.ZS, the unemployment, total (% of total labor force), and FR.INR.RINR, the interest rate.
To inspect the data you can use the following code.
from pandas_datareader import wb data = wb.download(indicator=['SL.UEM.TOTL.ZS', 'FR.INR.RINR'], country=['US'], start=1990, end=2019) uem_data = data.unstack().T.loc['SL.UEM.TOTL.ZS'] uem_data.columns = ['US-unempl'] int_data = data.unstack().T.loc['FR.INR.RINR'] int_data.columns = ['US-int'] data = int_data.join(uem_data) print(data)
Giving an output similar to this (lines excluded).
US-int US-unempl year 1990 6.039744 NaN 1991 4.915352 6.800 1992 3.884240 7.500 1993 3.546689 6.900 1994 4.898356 6.119 1995 6.594069 5.650 1996 6.324008 5.451 1997 6.603407 5.000 1998 7.148192 4.510 1999 6.457135 4.219
For details on the unstacking and transposing, see this tutorial.
Step 2: Join the data from the S&P 500 index
First let’s get the data from S&P 500, which has ticker ^GSPC.
You can use the Pandas-datareader for that.
import pandas_datareader as pdr import datetime as dt start = dt.datetime(1990, 1, 1) end = dt.datetime(2019, 12, 31) sp500 = pdr.get_data_yahoo("^GSPC", start, end)['Adj Close'] sp500.name='S&P 500' print(sp500)
Resulting in the following output.
Date 1990-01-02 359.690002 1990-01-03 358.760010 1990-01-04 355.670013 1990-01-05 352.200012 1990-01-08 353.790009 ... 2019-12-24 3223.379883 2019-12-26 3239.909912 2019-12-27 3240.020020 2019-12-30 3221.290039 2019-12-31 3230.780029 Name: S&P 500, Length: 7559, dtype: float64
Problem! The date is a datetime object in the above Series, while it is a string with a year in the DataFrame with unemployment rate and interest rate above.
To successfully join them, we need to convert them into same format. The best way is to convert them into a datetime. We can do that by using the pd.to_datetime() function.
import pandas_datareader as pdr import pandas as pd import datetime as dt from pandas_datareader import wb data = wb.download(indicator=['SL.UEM.TOTL.ZS', 'FR.INR.RINR'], country=['US'], start=1990, end=2019) uem_data = data.unstack().T.loc['SL.UEM.TOTL.ZS'] uem_data.columns = ['US-unempl'] int_data = data.unstack().T.loc['FR.INR.RINR'] int_data.columns = ['US-int'] data = int_data.join(uem_data) data.index = pd.to_datetime(data.index, format='%Y') start = dt.datetime(1990, 1, 1) end = dt.datetime(2019, 12, 31) sp500 = pdr.get_data_yahoo("^GSPC", start, end)['Adj Close'] sp500.name='S&P 500' data = sp500.to_frame().join(data, how='outer') print(data)
Resulting in the following output.
S&P 500 US-int US-unempl 1990-01-01 NaN 6.039744 NaN 1990-01-02 359.690002 NaN NaN 1990-01-03 358.760010 NaN NaN 1990-01-04 355.670013 NaN NaN 1990-01-05 352.200012 NaN NaN ... ... ... ... 2019-12-24 3223.379883 NaN NaN 2019-12-26 3239.909912 NaN NaN 2019-12-27 3240.020020 NaN NaN 2019-12-30 3221.290039 NaN NaN 2019-12-31 3230.780029 NaN NaN
The problem you see there, is that data from US-int and US-unempl. only has data the first of January every year. To fix that, we can make a linear interpolation of the data by applying the following.
data = data.interpolate(method='linear')
Resulting in.
S&P 500 US-int US-unempl 1990-01-01 NaN 6.039744 NaN 1990-01-02 359.690002 6.035318 NaN 1990-01-03 358.760010 6.030891 NaN 1990-01-04 355.670013 6.026464 NaN 1990-01-05 352.200012 6.022037 NaN ... ... ... ... 2019-12-24 3223.379883 3.478200 3.682 2019-12-26 3239.909912 3.478200 3.682 2019-12-27 3240.020020 3.478200 3.682 2019-12-30 3221.290039 3.478200 3.682 2019-12-31 3230.780029 3.478200 3.682
Notice, that since there is no unemployment data for 1990 in US, it will fill them with NaN until first rate is given.
Step 3: Visualize all three graphs with 3 different y-axis
Now here Pandas are quite strong. By default, you can create a secondary y-axis. As the 3 datasets only need two y-axis, as the unemployment and interest rate can share the same y-axis.
import pandas_datareader as pdr import pandas as pd import matplotlib.pyplot as plt import datetime as dt from pandas_datareader import wb pd.set_option('display.max_rows', 300) pd.set_option('display.max_columns', 10) pd.set_option('display.width', 1000) data = wb.download(indicator=['SL.UEM.TOTL.ZS', 'FR.INR.RINR'], country=['US'], start=1990, end=2019) uem_data = data.unstack().T.loc['SL.UEM.TOTL.ZS'] uem_data.columns = ['US-unempl'] int_data = data.unstack().T.loc['FR.INR.RINR'] int_data.columns = ['US-int'] data = int_data.join(uem_data) data.index = pd.to_datetime(data.index, format='%Y') start = dt.datetime(1990, 1, 1) end = dt.datetime(2019, 12, 31) sp500 = pdr.get_data_yahoo("^GSPC", start, end)['Adj Close'] sp500.name='S&P 500' data = sp500.to_frame().join(data, how='outer') data = data.interpolate(method='linear') ax = data['S&P 500'].plot(legend=True) ax = data[['US-int','US-unempl']].plot(ax=ax, secondary_y=True, legend=True) print(data.corr()) plt.show()
Where the correlation is given here.
S&P 500 US-int US-unempl S&P 500 1.000000 -0.408429 -0.453315 US-int -0.408429 1.000000 -0.470103 US-unempl -0.453315 -0.470103 1.000000
Which is surprisingly low. Visually, you can see it here.
