## What is the Moving Average Convergence Divergence Indicator?

Moving Average Convergence Divergence (MACD) is a trend-following momentum indicator that shows the relationship between two moving averages of a security’s price. The MACD is calculated by subtracting the 26-period Exponential Moving Average (EMA) from the 12-period EMA.

https://www.investopedia.com/terms/m/macd.asp

That is easy to understand, right?

The good news is that it is easy to calculate using the Pandas DataFrames.

Well, the MACD is a technical indicator that helps to understand if it is a bullish or bearish market. That is, it can help the investor to understand if he should buy or sell the stock.

## Step 1: Get the stock data

A great source to get stock data is by using the Pandas-datareader library to collect it.

```import pandas_datareader as pdr
import datetime as dt

start = dt.datetime(2020, 1, 1)
end = dt.datetime.now()
ticker = pdr.get_data_yahoo("AAPL", start, end)['Adj Close']
print(ticker)
```

Here we collect it for Apple (ticker AAPL) from the beginning of the year 2020.

```Date
2020-01-02    298.292145
2020-01-03    295.392120
2020-01-06    297.745880
2020-01-07    296.345581
2020-01-08    301.112640
...
2020-08-05    439.457642
2020-08-06    454.790009
2020-08-07    444.450012
2020-08-10    450.910004
2020-08-11    444.095001
Name: Adj Close, Length: 154, dtype: float64
```

Note that we only keep the Adjusted Close (Adj Close) column to make our calculations.

## Step 2: Make the calculations

As the description says, we need the Exponential Moving Averages for a rolling 12-days and 26-days window.

Luckily, the Pandas DataFrame provides a function ewm(), which together with the mean-function can calculate the Exponential Moving Averages.

```exp1 = ticker.ewm(span=12, adjust=False).mean()
macd = exp1 - exp2
```

But more is needed. We need to make a signal line, which is also defined.

A nine-day EMA of the MACD called the “signal line,” is then plotted on top of the MACD line, which can function as a trigger for buy and sell signals.

https://www.investopedia.com/terms/m/macd.asp

Hence, we end up with the following.

```exp1 = ticker.ewm(span=12, adjust=False).mean()
macd = exp1 - exp2
```

## Step 3: Plot the data

We need to plot two y-scales for the plot. One for the MACD and the 9 day EMA of MACD. And one for the actually stock price.

Luckily the Pandas plot function supports having two y-axis.

```macd.plot(label='AAPL MACD', color='g')
ax = exp3.plot(label='Signal Line', color='r')
ticker.plot(ax=ax, secondary_y=True, label='AAPL')
```

As you see, the first two calls to plot use the same axis (the left side) and the final one on ticker, uses the secondary_y (the right side axis).

Then we need to setup labels, legends, and names on axis.

```ax.set_ylabel('MACD')
ax.right_ax.set_ylabel('Price \$')
ax.set_xlabel('Date')
lines = ax.get_lines() + ax.right_ax.get_lines()
ax.legend(lines, [l.get_label() for l in lines], loc='upper left')
```

The variable lines collects the lines plotted on both y-axis and then makes the legend. This is needed, otherwise only the last legend will be visible.

All together it becomes.

```import pandas_datareader as pdr
import datetime as dt
import matplotlib.pyplot as plt

start = dt.datetime(2020, 1, 1)
end = dt.datetime.now()
ticker = pdr.get_data_yahoo("AAPL", start, end)['Adj Close']

macd = exp1 - exp2
macd.plot(label='AAPL MACD', color='g')
ax = exp3.plot(label='Signal Line', color='r')
ticker.plot(ax=ax, secondary_y=True, label='AAPL')

ax.set_ylabel('MACD')
ax.right_ax.set_ylabel('Price \$')
ax.set_xlabel('Date')
lines = ax.get_lines() + ax.right_ax.get_lines()
ax.legend(lines, [l.get_label() for l in lines], loc='upper left')
plt.show()
```

Resulting in the graph.

When the signal line (red one) crosses the MACD (green) line, it is time to sell if the green is below and buy if the green is above.

Notice that this is done on historical data and is no guarantee it will work in the future. While the results look pretty promising, it is not wise to make your investments solely on one indicator.

## What will we cover in this tutorial?

In this tutorial we will on a live example investigate and understand the differences between the 3 methods to calculate correlation using Pandas DataFrame corr() function.

The purpose of this tutorial is to get a better understanding of these correlations, while working on real data.

## Step 1: Getting some data to play with

The data we want to investigate for correlations is the US GDP, S&P 500, Gold and Oil prices. We will only focus on recent time (from 2000-2020), as the prices for Gold and Oil are not available further back on Yahoo! finance. We will get the US GDP from World Bank and the rest from Yahoo! finance.

We will be using the Pandas-datareader to retrieve the data. For a more in-depth introduction to how to use them, we will refer you to this tutorial.

```import pandas_datareader as pdf
import datetime as dt

start = dt.datetime(2000, 1, 1)
end = dt.datetime.now()
tickers = pdf.get_data_yahoo(["^GSPC", "GC=F", "CL=F"], start, end)['Adj Close']

gdp = gdp.reset_index(1).set_index('year')
gdp.index = pd.to_datetime(gdp.index, format="%Y")

data = gdp.join(tickers, how='outer')
data = data.interpolate(method='linear')
data = data.dropna()

data.columns = ["US GDP", "S&amp;P 500", "Gold", "Oil"]

print(data)
```

Resulting in the following output.

```Python 3.8.2 (default, Feb 26 2020, 02:56:10)
US GDP      S&amp;P 500         Gold        Oil
2000-08-30  1.047113e+13  1502.589966   273.899994  33.400002
2000-08-31  1.047243e+13  1517.680054   278.299988  33.099998
2000-09-01  1.047373e+13  1520.770020   277.000000  33.380001
2000-09-05  1.047503e+13  1507.079956   275.799988  33.799999
2000-09-06  1.047634e+13  1492.250000   274.200012  34.950001
...                  ...          ...          ...        ...
2020-08-05  2.142770e+13  3327.770020  2031.099976  42.189999
2020-08-06  2.142770e+13  3349.159912  2051.500000  41.950001
2020-08-07  2.142770e+13  3351.280029  2046.099976  41.599998
2020-08-09  2.142770e+13  3351.280029  2037.099976  41.590000
2020-08-10  2.142770e+13  3351.280029  2043.900024  41.889999
```

Where we see the data we want to investigate for correlations.

## Step 2: Investigate Pearson correlation coefficients

Looking at the corr() function on DataFrames it calculate the pairwise correlation between columns and returns a correlation matrix.

The default method is the Pearson correlation coefficient method. As we will see in this tutorial, correlations can be calculated differently. The Pearson is trying to correlate through a straight line between the variables.

The best way to understand that is by using an example.

Let’s first calculate the correlation matrix using the Pearson method and then try to visualize it to understand it better. You can get the correlation method simply by calling corr() on the DataFrame.

```print(data.corr())
```

As it is the default method you do not need to set it be pearson. The output will be.

```           US GDP   S&amp;P 500      Gold       Oil
US GDP   1.000000  0.897376  0.817294  0.237426
S&amp;P 500  0.897376  1.000000  0.581576 -0.015951
Gold     0.817294  0.581576  1.000000  0.534163
Oil      0.237426 -0.015951  0.534163  1.000000
```

A few words on a correlation matrix. The output of the correlation function is a number from -1 to 1. Some high-level interpretations of the output.

• -1: A full negative correlation. Meaning if variable goes up, the other variable goes down and they are fully correlated.
• 0: No correlation at all. Meaning that the two variables are not dependent at all. If one goes up, you cannot predict with any probability what will happen to the other.
• 1: A full correlation. Meaning if the one variable goes up, so will the other.

Numbers between are just indication how much they are dependet.

Looking at the above output, you see that US GDP fully correlates to US GDP. This is obvious, as it is the same variable. Next we have a 0.897376 correlation between US GDP and S&P 500 stock market index. This tells us that there is a high correlation.

Now to be a bit more specific. This correlation is linear.

That means it can be fitted well with a straight line. Let’s try to visualize that.

```import matplotlib.pyplot as plt
import numpy as np

# Pearson fit (default method)
fit = np.polyfit(x=data['US GDP'], y=data['S&amp;P 500'], deg=1)
line_fit = np.poly1d(fit)
plt.plot(data['US GDP'], line_fit(data['US GDP']))
plt.scatter(x=data['US GDP'], y=data['S&amp;P 500'], color='red', alpha=0.1)
plt.title("Pearson correlation")
plt.show()
```

Resulting in the following fit.

Also, let’s investigate something that does not fit well, the US GDP with Oil prices.

```import matplotlib.pyplot as plt
import numpy as np

# Pearson fit (default method)
fit = np.polyfit(x=data['US GDP'], y=data['Oil'], deg=1)
line_fit = np.poly1d(fit)
plt.plot(data['US GDP'], line_fit(data['Oil']))
plt.scatter(x=data['US GDP'], y=data['Oil'], color='red', alpha=0.1)
plt.title("Pearson correlation")
plt.show()
```

As you can see visually, this does not fit as well to a straight line as the above example. The closer the markers are to a fitted straight line, the higher score of the correlation using Pearson. This is independent on the slope of the line, except if the slope is positive (resulting in positive values) or negative (resulting in negative values).

Just some notes to consider about Pearson correlation coefficient. The requirement of the variables being normally distributed is controversial and outside the scope of this tutorial. That said, be careful concluding based on the result. It might be an indicator, but do not conclude any linear correlations or not based on the result.

## Step 3: Investigating the Kendall rank correlation coefficients

The Kendall rank correlation coefficient does not assume a normal distribution of the variables and is looking for a monotonic relationship between two variables.

Two variables are monotonic correlated if any greater value of the one variable will result in a greater value of the other variable. If the variables is negatively monotonic correlated, then it is opposite.

The correlation can be calculated as follows.

```print(data.corr(method="kendall"))
```

Resulting in the following output.

```           US GDP   S&amp;P 500      Gold       Oil
US GDP   1.000000  0.703141  0.685002  0.249430
S&amp;P 500  0.703141  1.000000  0.426406  0.122434
Gold     0.685002  0.426406  1.000000  0.413298
Oil      0.249430  0.122434  0.413298  1.000000
```

Which interestingly shows that the Pearson correlation coefficient of US GDP and S&P 500 is higher than the Kendall rank correlation.

As a rule thumb, a correlation less than 0.8 (or greater than -0.8) is considered insignificant and not strongly correlated. This means, that the correlation of US GDP and S&P 500 seems to have a linear correlation but not a strong monotonic correlation.

Remember that these are two different measures and can not be directly compared. As they measure different aspects, it is not surprising. The Pearson method can be thought of how close the points are to a fitted line, while the Kendall method looks if the one variable grows, does the other. As you see on the map, this seems not to be the case. There are many instances where it does not happen.

## Step 4: Investigating the Spearman rank correlation

Spearman is closely related to Kendall, and measures whether the variables are monotonically correlated.

The Spearman rank correlation can be computed by the following.

```print(data.corr(method="spearman"))
```

And results in the following output.

```           US GDP   S&amp;P 500      Gold       Oil
US GDP   1.000000  0.846197  0.837650  0.317295
S&amp;P 500  0.846197  1.000000  0.609104  0.178937
Gold     0.837650  0.609104  1.000000  0.558569
Oil      0.317295  0.178937  0.558569  1.000000
```

Which actually is a bit more optimistic about the monotonic correlation between the US GDP and S&P 500.

Can we then conclude that when US GDP goes up, the S&P 500 goes up? Good question. The short answer is no. Example that might make it more understandable. In summer time ice cream sales go up. But also, in summer time sun glass sales goes up. Does that mean that higher ice cream sales implies higher sun glass sales? Not really. It is the factor that there is more sun that affect it.

The same can be true for correlations you find in data. Just think of it as an indicator that they somehow might be connected (or not, if value is close to 0).

## Step 5: When to use what?

This is a good question.

• Pearson correlation coefficient is in general considered stronger as has higher assumptions on data. On the negative, it only considers a full linear dependence (fitting to a straight line) and in (theory) requires the variables to be normally distributed. It is very fragile to outliers (single points far away from the norm).
• Kendall rank correlation coefficient should be more efficient with smaller sets. It measures the monotonic relationship between two variables, and it is a bit slower to calculate O(n^2). It does not require the variables to be normally distributed.
• Spearman rank correlation coefficient also measures the monotonic relationship between two variables. The speed is faster O(n log(n)). It often gives a slightly higher value than Kendalls. It also does not require the variables to be normally distributed.

## What will we cover in this tutorial?

In this tutorial we will visualize the inflation on a map. This will be done by getting the inflation data directly from World Bank using the Pandas-datareader. This data will be joined with data from GeoPandas, which provides a world map we can use to create a Choropleth map.

## Step 1: Retrieve the inflation data from World Bank

The Pandas-datareader has an interface to get data from World Bank. To find interesting data from World Bank you should explore data.worldbank.org, which contains various interesting indicators.

When you find one, like the Inflation, consumer prices (annual %), we will use, you can see that you can download it in CSV, XML, or excel. But we are not old fashioned, hence, we will use the direct API to get fresh data every time we run our program.

To use the API, we need the indicator, which you will find in the url. In this case.

```https://data.worldbank.org/indicator/FP.CPI.TOTL.ZG
```

Hence we have it FP.CPI.TOTL.ZG.

Using the Pandas-datareader API you can get the data by running the following piece of code.

```from pandas_datareader import wb

print(data)
```

If you inspect the output, you will see it is structured a bit inconvenient.

```                                                         FP.CPI.TOTL.ZG
country                                            year
Arab World                                         2019        1.336016
Caribbean small states                             2019             NaN
Central Europe and the Baltics                     2019        2.664561
Early-demographic dividend                         2019        3.030587
East Asia &amp; Pacific                                2019        1.773102
East Asia &amp; Pacific (excluding high income)        2019        2.779172
East Asia &amp; Pacific (IDA &amp; IBRD countries)         2019        2.779172
```

It has two indexes.

We want to reset index 1 (the year) and, which will make year to a column. Then for convenience we should rename the columns.

```from pandas_datareader import wb

data = data.reset_index(1)
data.columns = ['year', 'inflation']
print(data)
```

Resulting in the following.

```                                                    year  inflation
country
Arab World                                          2019   1.336016
Caribbean small states                              2019        NaN
Central Europe and the Baltics                      2019   2.664561
Early-demographic dividend                          2019   3.030587
East Asia &amp; Pacific                                 2019   1.773102
East Asia &amp; Pacific (excluding high income)         2019   2.779172
East Asia &amp; Pacific (IDA &amp; IBRD countries)          2019   2.779172
```

## Step 2: Retrieve the world map data

The world map data is available from GeoPandas. At first glance everything is easy.

```import geopandas

map = map[map['name'] != 'Antarctica']
print(map)
```

Where I excluded Antarctica for visual purposes. Inspecting some of the output.

```        pop_est                continent                      name iso_a3   gdp_md_est                                           geometry
0        920938                  Oceania                      Fiji    FJI      8374.00  MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1      53950935                   Africa                  Tanzania    TZA    150600.00  POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2        603253                   Africa                 W. Sahara    ESH       906.50  POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
3      35623680            North America                    Canada    CAN   1674000.00  MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
4     326625791            North America  United States of America    USA  18560000.00  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...
5      18556698                     Asia                Kazakhstan    KAZ    460700.00  POLYGON ((87.35997 49.21498, 86.59878 48.54918...
6      29748859                     Asia                Uzbekistan    UZB    202300.00  POLYGON ((55.96819 41.30864, 55.92892 44.99586...
```

It seems to be a good match to join the data on the name column.

To make it easy, we can make the name column index.

```import geopandas

map = map[map['name'] != 'Antarctica']
map = map.set_index('name')
```

## Step 3: Joining the datasets

This is the fun part of Data Science. Why? I am glad you asked. Well, it was an irony. The challenge will be apparent in a moment. There are various ways to deal with it, but in this tutorial we will use a simplistic approach.

Let us do the join.

```from pandas_datareader import wb
import geopandas

pd.set_option('display.width', 3000)
pd.set_option('display.max_columns', 300)
pd.set_option('display.max_rows', 500)

map = map[map['name'] != 'Antarctica']
map = map.set_index('name')

data = data.reset_index(1)
data.columns = ['year', 'inflation']

map = map.join(data, how='outer')
print(map)
```

Where I use an outer join, to get all the “challenges” visible.

```Russia                                              1.422575e+08                   Europe    RUS   3745000.00  MULTIPOLYGON (((178.72530 71.09880, 180.00000 ...   NaN        NaN
Russian Federation                                           NaN                      NaN    NaN          NaN                                               None  2019   4.470367
...
United States                                                NaN                      NaN    NaN          NaN                                               None  2019   1.812210
United States of America                            3.266258e+08            North America    USA  18560000.00  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...   NaN        NaN
```

Where I only took two snippets. The key thing is here, that the data from GeoPandas, containing the map, and data from World Bank, containing the inflation rates we want to color the map with, are not joined.

Hence, we need to join United States together with United States of America. And Russia with Russian Federation.

We would use a location service, which maps counties to country codes. Hence, mapping each data sets country names to country codes (note that GeoPandas already has 3 letter country codes, but some are missing, like Norway and more). This approach still can have some missing pieces, as some country names are not known by the mapping.

Another approach is to look find all the data not mapped and rename them in one of the datasets. This can take some time, but I did most of them in the following.

```from pandas_datareader import wb
import geopandas

map = map[map['name'] != 'Antarctica']
map = map.set_index('name')
index_change = {
'United States of America': 'United States',
'Yemen': 'Yemen, Rep.',
'Venezuela': 'Venezuela, RB',
'Syria': 'Syrian Arab Republic',
'Solomon Is.': 'Solomon Islands',
'Russia': 'Russian Federation',
'Iran': 'Iran, Islamic Rep.',
'Gambia': 'Gambia, The',
'Kyrgyzstan': 'Kyrgyz Republic',
'Mauritania': 'Mauritius',
'Egypt': 'Egypt, Arab Rep.'
}
map = map.rename(index=index_change)

data = data.reset_index(1)
data.columns = ['year', 'inflation']

map = map.join(data, how='outer')
```

## Step 4: Making a Choropleth map based on our dataset

The simple plot of the data will not be very insightful. But let’s try that first.

```map.plot('inflation')
plt.title("Inflation 2019")
plt.show()
```

Resulting in the following.

A good way to get inspiration is to check out the documentation with examples.

Where you see a cool color map with scheme=’quantiles’. Let’s try that.

```map.plot('inflation', cmap='OrRd', scheme='quantiles')
plt.title("Inflation 2019")
plt.show()
```

Resulting in the following.

Adding grey tone to countries not mapped, adding a legend, setting the size. Then we are done. The full source code is here.

```from pandas_datareader import wb
import geopandas
import pandas as pd
import matplotlib.pyplot as plt

map = map[map['name'] != 'Antarctica']
map = map.set_index('name')
index_change = {
'United States of America': 'United States',
'Yemen': 'Yemen, Rep.',
'Venezuela': 'Venezuela, RB',
'Syria': 'Syrian Arab Republic',
'Solomon Is.': 'Solomon Islands',
'Russia': 'Russian Federation',
'Iran': 'Iran, Islamic Rep.',
'Gambia': 'Gambia, The',
'Kyrgyzstan': 'Kyrgyz Republic',
'Mauritania': 'Mauritius',
'Egypt': 'Egypt, Arab Rep.'
}
map = map.rename(index=index_change)

data = data.reset_index(1)
data.columns = ['year', 'inflation']

map = map.join(data, how='outer')

map.plot('inflation', cmap='OrRd', scheme='quantiles', missing_kwds={"color": "lightgrey"}, legend=True, figsize=(14,5))
plt.title("Inflation 2019")
plt.show()
```

Resulting in the following output. Inflation data from World Bank mapped on a Choropleth map using GeoPandas and MatPlotLib.

## What will we cover in this tutorial?

We will continue our exploration of the amazing Pandas-datareader. In this tutorial we will further investigate data from World Bank and correlate it with S&P 500 Stock index. We will do this both by visualizing 3 graphs on 2 different y-axis as well as compute the correlation.

## Step 1: Get World Bank data

In this tutorial we will only look at data from United States. If you are interested in other tutorials on. World Bank data you should read this one and this one.

To get the data of the World Bank you can use the Pandas-datareader, which has a function to download data if you have the indicator.

`p``andas_datareader.wb.download`(country=Noneindicator=Nonestart=2003end=2005freq=Noneerrors=’warn’**kwargs)

That takes the country and indicator, start, and end year as arguments.

You can find indicators on the webpage of World Bank.

In this tutorial we will use the SL.UEM.TOTL.ZS, the unemployment, total (% of total labor force), and FR.INR.RINR, the interest rate.

To inspect the data you can use the following code.

```from pandas_datareader import wb

uem_data = data.unstack().T.loc['SL.UEM.TOTL.ZS']
uem_data.columns = ['US-unempl']
int_data = data.unstack().T.loc['FR.INR.RINR']
int_data.columns = ['US-int']

data = int_data.join(uem_data)

print(data)
```

Giving an output similar to this (lines excluded).

```        US-int  US-unempl
year
1990  6.039744        NaN
1991  4.915352      6.800
1992  3.884240      7.500
1993  3.546689      6.900
1994  4.898356      6.119
1995  6.594069      5.650
1996  6.324008      5.451
1997  6.603407      5.000
1998  7.148192      4.510
1999  6.457135      4.219
```

For details on the unstacking and transposing, see this tutorial.

## Step 2: Join the data from the S&P 500 index

First let’s get the data from S&P 500, which has ticker ^GSPC.

You can use the Pandas-datareader for that.

```import pandas_datareader as pdr
import datetime as dt

start = dt.datetime(1990, 1, 1)
end = dt.datetime(2019, 12, 31)
sp500 = pdr.get_data_yahoo("^GSPC", start, end)['Adj Close']
sp500.name='S&amp;P 500'
print(sp500)
```

Resulting in the following output.

```Date
1990-01-02     359.690002
1990-01-03     358.760010
1990-01-04     355.670013
1990-01-05     352.200012
1990-01-08     353.790009
...
2019-12-24    3223.379883
2019-12-26    3239.909912
2019-12-27    3240.020020
2019-12-30    3221.290039
2019-12-31    3230.780029
Name: S&amp;P 500, Length: 7559, dtype: float64
```

Problem! The date is a datetime object in the above Series, while it is a string with a year in the DataFrame with unemployment rate and interest rate above.

To successfully join them, we need to convert them into same format. The best way is to convert them into a datetime. We can do that by using the pd.to_datetime() function.

```import pandas_datareader as pdr
import pandas as pd
import datetime as dt

uem_data = data.unstack().T.loc['SL.UEM.TOTL.ZS']
uem_data.columns = ['US-unempl']
int_data = data.unstack().T.loc['FR.INR.RINR']
int_data.columns = ['US-int']

data = int_data.join(uem_data)
data.index = pd.to_datetime(data.index, format='%Y')

start = dt.datetime(1990, 1, 1)
end = dt.datetime(2019, 12, 31)
sp500 = pdr.get_data_yahoo("^GSPC", start, end)['Adj Close']
sp500.name='S&amp;P 500'

data = sp500.to_frame().join(data, how='outer')
print(data)
```

Resulting in the following output.

```                S&amp;P 500    US-int  US-unempl
1990-01-01          NaN  6.039744        NaN
1990-01-02   359.690002       NaN        NaN
1990-01-03   358.760010       NaN        NaN
1990-01-04   355.670013       NaN        NaN
1990-01-05   352.200012       NaN        NaN
...                 ...       ...        ...
2019-12-24  3223.379883       NaN        NaN
2019-12-26  3239.909912       NaN        NaN
2019-12-27  3240.020020       NaN        NaN
2019-12-30  3221.290039       NaN        NaN
2019-12-31  3230.780029       NaN        NaN

```

The problem you see there, is that data from US-int and US-unempl. only has data the first of January every year. To fix that, we can make a linear interpolation of the data by applying the following.

```data = data.interpolate(method='linear')
```

Resulting in.

```                S&amp;P 500    US-int  US-unempl
1990-01-01          NaN  6.039744        NaN
1990-01-02   359.690002  6.035318        NaN
1990-01-03   358.760010  6.030891        NaN
1990-01-04   355.670013  6.026464        NaN
1990-01-05   352.200012  6.022037        NaN
...                 ...       ...        ...
2019-12-24  3223.379883  3.478200      3.682
2019-12-26  3239.909912  3.478200      3.682
2019-12-27  3240.020020  3.478200      3.682
2019-12-30  3221.290039  3.478200      3.682
2019-12-31  3230.780029  3.478200      3.682
```

Notice, that since there is no unemployment data for 1990 in US, it will fill them with NaN until first rate is given.

## Step 3: Visualize all three graphs with 3 different y-axis

Now here Pandas are quite strong. By default, you can create a secondary y-axis. As the 3 datasets only need two y-axis, as the unemployment and interest rate can share the same y-axis.

```import pandas_datareader as pdr
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt

pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 1000)

uem_data = data.unstack().T.loc['SL.UEM.TOTL.ZS']
uem_data.columns = ['US-unempl']
int_data = data.unstack().T.loc['FR.INR.RINR']
int_data.columns = ['US-int']

data = int_data.join(uem_data)
data.index = pd.to_datetime(data.index, format='%Y')

start = dt.datetime(1990, 1, 1)
end = dt.datetime(2019, 12, 31)
sp500 = pdr.get_data_yahoo("^GSPC", start, end)['Adj Close']
sp500.name='S&amp;P 500'

data = sp500.to_frame().join(data, how='outer')
data = data.interpolate(method='linear')

ax = data['S&amp;P 500'].plot(legend=True)
ax = data[['US-int','US-unempl']].plot(ax=ax, secondary_y=True, legend=True)

print(data.corr())

plt.show()
```

Where the correlation is given here.

```            S&amp;P 500    US-int  US-unempl
S&amp;P 500    1.000000 -0.408429  -0.453315
US-int    -0.408429  1.000000  -0.470103
US-unempl -0.453315 -0.470103   1.000000
```

Which is surprisingly low. Visually, you can see it here. The S&P 500 stock index, US interest rate and US unemployment rate

## What will we cover in this tutorial?

In this tutorial we will explore some aspects of the Pandas-Datareader, which is an invaluable way to get data from many sources, including the World Bank and Yahoo! Finance.

In this tutorial we will investigate if the GDP of a country is correlated to the stock market.

## Step 1: Get GDP data from World Bank

In the previous tutorial we looked at the GDP per capita and compared it between countries. GDP per capita is a good way to compare country’s economy between each other.

In this tutorial we will look at the GDP and using the NY.GDP.MKTP.CD indicator of GDP in current US\$.

```from pandas_datareader import wb

print(gdp)
```

Resulting in the following output.

```                    NY.GDP.MKTP.CD
country       year
United States 2019  21427700000000
2018  20580223000000
2017  19485393853000
2016  18707188235000
2015  18219297584000
2014  17521746534000
2013  16784849190000
2012  16197007349000
2011  15542581104000
```

## Step 2: Gathering the stock index

Then we need to gather the data from the stock market. As we look at the US stock market, the S&P 500 index is a good indicator of the market.

The ticker of S&P 500 is ^GSPC (yes, with the ^).

The Yahoo! Finance api is a great place to collect this type of data.

```import pandas_datareader as pdr
import datetime as dt

start = dt.datetime(1990, 1, 1)
end = dt.datetime(2019, 12, 31)
sp500 = pdr.get_data_yahoo("^GSPC", start, end)['Adj Close']
print(sp500)
```

Resulting in the following output.

```Date
1990-01-02     359.690002
1990-01-03     358.760010
1990-01-04     355.670013
1990-01-05     352.200012
1990-01-08     353.790009
...
2019-12-24    3223.379883
2019-12-26    3239.909912
2019-12-27    3240.020020
2019-12-30    3221.290039
2019-12-31    3230.780029
```

## Step 3: Visualizing the data on one plot

A good way to see if there is a correlation is simply by visualizing it.

This can be done with a few tweaks.

```import pandas_datareader as pdr
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt

gdp = gdp.unstack().T.reset_index(0)
gdp.index = pd.to_datetime(gdp.index, format='%Y')

start = dt.datetime(1990, 1, 1)
end = dt.datetime(2019, 12, 31)
sp500 = pdr.get_data_yahoo("^GSPC", start, end)['Adj Close']

data = sp500.to_frame().join(gdp, how='outer')
data = data.interpolate(method='linear')

ax = data['United States'].plot(ax=ax, secondary_y=True)

plt.show()
```

The GDP data needs to be formatted differently, by unstack’ing, transposing, and resetting the index. Then the index is converted from being strings of year to actually time series.

We use a outer join to get all the dates in the time series. Then we interpolate with a linear method to fill out the gab in the graph.

Finally, we make a plot af Adj Close of S&P 500 stock index and on of the GDP of United States, where we use the same graph, but using the secondary y-axis to plot. That means, the time series on the x-axis is the same.

The resulting graph is.

It could look like a correlation, which is visible in the aftermath of 2008.

## Step 4: Calculate a correlation

Let’s try to make some correlation calculations.

First, let’s not just rely on how US GDP correlates to the US stock market. Let us try to relate it to other countries GDP and see how they relate to the strongest economy in the world.

```import pandas_datareader as pdr
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt

gdp = wb.download(indicator='NY.GDP.MKTP.CD', country=['NO', 'FR', 'US', 'GB', 'DK', 'DE', 'SE'], start=1990, end=2019)

gdp = gdp.unstack().T.reset_index(0)
gdp.index = pd.to_datetime(gdp.index, format='%Y')

start = dt.datetime(1990, 1, 1)
end = dt.datetime(2019, 12, 31)
sp500 = pdr.get_data_yahoo("^GSPC", start, end)['Adj Close']

data = sp500.to_frame().join(gdp, how='outer')
data = data.interpolate(method='linear')

print(data.corr())

```

Where we compare it the the GDP for some more countries to verify our hypothesis.

```                Adj Close   Denmark    France   Germany    Norway    Sweden  United Kingdom  United States
Adj Close        1.000000  0.729701  0.674506  0.727289  0.653507  0.718829        0.759239       0.914303
Denmark          0.729701  1.000000  0.996500  0.986769  0.975780  0.978550        0.955674       0.926139
France           0.674506  0.996500  1.000000  0.982225  0.979767  0.974825        0.945877       0.893780
Germany          0.727289  0.986769  0.982225  1.000000  0.953131  0.972542        0.913443       0.916239
Norway           0.653507  0.975780  0.979767  0.953131  1.000000  0.978784        0.933795       0.878704
Sweden           0.718829  0.978550  0.974825  0.972542  0.978784  1.000000        0.930621       0.916530
United Kingdom   0.759239  0.955674  0.945877  0.913443  0.933795  0.930621        1.000000       0.915859
United States    0.914303  0.926139  0.893780  0.916239  0.878704  0.916530        0.915859       1.000000
```

Now that is interesting. The US Stock market (Adj Close) correlates the strongest with the US GDP. Not surprising.

Of the chosen countries, the Danish GDP is the second most correlated to US stock market. The GDP of the countries correlate all strongly with the US GDP. There Norway correlates the least.

Continue the exploration of World Bank data.

## What will we cover in this tutorial?

Introduction to the Pandas-Datareader, which is an invaluable way to get data from many sources, including the World Bank.

In this tutorial we will cover how to get the data from GDP per capita yearly data from countries and plot them.

## Step 1: Get to know World Bank as a data source

The World Bank was founded in 1944 to make loans to low-income countries, with the purpose to decrease poverty in the world (see wikipedia.org for further history).

What you might not know, World Bank has an amazing sets of data that you can either browse on their webpage or get access to directly in Python using the Pandas-Datareader.

We will take a look at how to extract the NY.GDP.PCAP.KD indicator.

The what?

I know. The GDP per capita (constant 2010 US\$), as it states on the webpage.

On that page you can get the GDP per capita for each country in the world back to 1960.

That is what we are going to do.

## Step 2: Get the data

Reading the Pandas-datareaders World Bank documentation you fall over the following function.

`pandas_datareader.wb.download`(country=Noneindicator=Nonestart=2003end=2005freq=Noneerrors=’warn’**kwargs)

Where you can set the country (or countries) and indicator your want:

• country (string or list of strings.) – `all` downloads data for all countries 2 or 3 character ISO country codes select individual countries (e.g.“US“,“CA“) or (e.g.“USA“,“CAN“). The codes can be mixed.The two ISO lists of countries, provided by wikipedia, are hardcoded into pandas as of 11/10/2014.
• indicator (string or list of strings) – taken from the `id` field in `WDIsearch()`

Luckily we already have our indicator from Step 1 (NY.GDP.PCAP.KD). Then we just need to find some countries of interest.

Let’s take United States, France, Great Britain, Denmark and Norway.

```from pandas_datareader import wb

print(dat)
```

Resulting in the following output.

```                    NY.GDP.PCAP.KD
country       year
Denmark       2019    65147.427182
2018    63915.468361
2017    62733.019808
2016    61877.976481
2015    60402.129248
...                            ...
United States 1964    19824.587845
1963    18999.888387
1962    18462.935998
1961    17671.150187
1960    17562.592084

[300 rows x 1 columns]
```

## Step 3: Visualize the data on a graph

We need to restructure the data in order to make a nice graph.

This can be done with unstack.

```from pandas_datareader import wb
import matplotlib.pyplot as plt

print(dat.unstack())
```

Which result in this output.

```               NY.GDP.PCAP.KD                ...
year                     1960          1961  ...          2018          2019
country                                      ...
Denmark          20537.549556  21695.609308  ...  63915.468361  65147.427182
France           12743.925100  13203.320855  ...  43720.026351  44317.392315
Norway           23167.441740  24426.011426  ...  92119.522964  92556.321645
United Kingdom   13934.029831  14198.673562  ...  43324.049759  43688.437455
United States    17562.592084  17671.150187  ...  54795.450086  55809.007792
```

If we transpose this and remove the double index frames that will come, then it should be good to make a plot with.

```from pandas_datareader import wb
import matplotlib.pyplot as plt

print(dat.unstack().T.reset_index(0))

dat.unstack().T.reset_index(0).plot()
plt.title('GDP per capita')
plt.show()
```

Giving this output, where you can see what the Transpose (T) does.

```country         level_0       Denmark  ...  United Kingdom  United States
year                                   ...
1960     NY.GDP.PCAP.KD  20537.549556  ...    13934.029831   17562.592084
1961     NY.GDP.PCAP.KD  21695.609308  ...    14198.673562   17671.150187
1962     NY.GDP.PCAP.KD  22747.292463  ...    14233.959944   18462.935998
1963     NY.GDP.PCAP.KD  22712.577808  ...    14816.480305   18999.888387
1964     NY.GDP.PCAP.KD  24620.461432  ...    15535.026991   19824.587845
1965     NY.GDP.PCAP.KD  25542.173921  ...    15766.195724   20831.299767
1966     NY.GDP.PCAP.KD  26032.378816  ...    15926.169851   21930.591173
1967     NY.GDP.PCAP.KD  27256.322071  ...    16282.026160   22235.415708
```

Giving the following output.

Continue the exploration in the following tutorial.

## What will we cover in this tutorial?

We will continue our journey to explore a big dataset of 145,000+ respondents to a RIASEC test. If you want to explore the full journey, we recommend you read this tutorial first as well as the second part of the tutorial, and finally, the third part before continuing.

In this part we will investigate if we can see any correlation between the major of education and the 6 dimensions of the personality types in RIASEC.

## Step 1: Group into major of educations

This is getting tricky, as the majors are typed in by the respondent. We will be missing some connections between them.

But let’s start by exploring them.

```import pandas as pd

major = data.loc[:,['major']]

print(major.groupby('major').size().sort_values(ascending=False))
```

The output is given here.

```major
psychology                6861
Psychology                5763
English                   2342
Biology                   1289
...
Sociology, Social work       1
Sociology, Psychology        1
Sociology, Math              1
Sociology, Linguistics       1
Nuerobiology                 1
Length: 15955, dtype: int64
```

Where we identify one problem, that some write with lowercase and others with uppercase.

## Step 2: Clean up a few ambiguities

The first step would be to lowercase everything.

```import pandas as pd

major = data.loc[:,['major']]
major['major'] = major['major'].str.lower()
print(major.groupby('major').size().sort_values(ascending=False).iloc[:10])
```

Now printing the 10 first lines.

```major
psychology          12766
english              3042
nursing              2142
biology              1961
education            1800
engineering          1353
accounting           1186
computer science     1159
psychology           1098
dtype: int64
```

Where we notice that psychology is the first and last. Inspecting it further, it seems the the last one has a space after it. Hence, we can try to remove whitespaces around all educations.

```import pandas as pd
import numpy as np

major = data.loc[:,['major']]
major['major'] = major['major'].str.lower()
major['major'] = major.apply(lambda row: row['major'].strip() if row['major'] is not np.nan else np.nan, axis=1)

print(major.groupby('major').size().sort_values(ascending=False).iloc[:10])
```

Now the output is as follows.

```major
psychology          13878
english              3240
nursing              2396
biology              2122
education            1954
engineering          1504
accounting           1292
computer science     1240
law                  1111
dtype: int64
```

Introducing law at the bottom of the list.

This process could continue, but let’s keep the focus on these 10 highest representative educations in the dataset. Obviously, further data points could be added if investigating it further.

## Step 3: See if education correlates to known words

First let’s explore the dataset a bit more. The respondents are asked if they know the definitions of the following words.

• boat
• incoherent
• pallid
• robot
• audible
• cuivocal
• paucity
• epistemology
• florted
• decide
• pastiche
• verdid
• abysmal
• lucid
• betray
• funny

Each word they know they mark. Hence, we can count the number of words each respondent knows and calculate an average per major group.

```import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

data['VCL'] = data['VCL1'] + data['VCL2'] + data['VCL3'] + data['VCL4'] + data['VCL5'] + data['VCL6'] + data['VCL7'] + data['VCL8'] + data['VCL9'] + data['VCL10'] + data['VCL11'] + data['VCL12'] + data['VCL13'] + data['VCL14'] + data['VCL15'] + data['VCL16']

view = data.loc[:, ['VCL', 'major']]
view['major'] = view['major'].str.lower()
view['major'] = view.apply(lambda row: row['major'].strip() if row['major'] is not np.nan else np.nan, axis=1)

view = view.groupby('major').aggregate(['mean', 'count'])
view = view[view['VCL','count'] > 1110]
view.loc[:,('VCL','mean')].plot(kind='barh', figsize=(14,5))
plt.show()
```

Which results in the following output.

The Engineers seem to score lower than nursing. Well, I am actually surprised that Computer Science scores that high.

## Step 4: Adding it all up together

Let’s use what we did in previous tutorial and use the calculations from there.

```import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

def sum_dimension(data, letter):
return data[letter + '1'] + data[letter + '2'] + data[letter + '3'] + data[letter + '4'] + data[letter + '5'] + data[letter + '6'] + data[letter + '7'] + data[letter + '8']

data['R'] = sum_dimension(data, 'R')
data['I'] = sum_dimension(data, 'I')
data['A'] = sum_dimension(data, 'A')
data['S'] = sum_dimension(data, 'S')
data['E'] = sum_dimension(data, 'E')
data['C'] = sum_dimension(data, 'C')
data['VCL'] = data['VCL1'] + data['VCL2'] + data['VCL3'] + data['VCL4'] + data['VCL5'] + data['VCL6'] + data['VCL7'] + data['VCL8'] + data['VCL9'] + data['VCL10'] + data['VCL11'] + data['VCL12'] + data['VCL13'] + data['VCL14'] + data['VCL15'] + data['VCL16']

view = data.loc[:, ['R', 'I', 'A', 'S', 'E', 'C', 'VCL', 'major']]
view['major'] = view['major'].str.lower()
view['major'] = view.apply(lambda row: row['major'].strip() if row['major'] is not np.nan else np.nan, axis=1)

view = view.groupby('major').aggregate(['mean', 'count'])
view = view[view['VCL','count'] > 1110]
view.loc[:,[('R','mean'), ('I','mean'),('A','mean'), ('S','mean'),('C','mean'), ('C','mean')]].plot(kind='barh', figsize=(14,5))
plt.show()
```

Which results in the following diagram.

Biology has high I (Investigative, people that prefer to work with data). While the R (Realistic, People who like to work with things) is dominated by Engineers and Computer Scientist.

Hmm… I should have noticed that many have major education.

## What will we cover in this tutorial?

We will continue our journey to explore a big dataset of 145,000+ respondents to a RIASEC test. If you want to explore the full journey, we recommend you read this tutorial first as well as the second part of the tutorial.

In this part we are going to combine some data into 6 dimensions of personality types of the RIASEC and see it there is any correlation with the educational level.

## Step 1: Understand the dataset better

The dataset is combined in letting the respondents rate themselves on statements related to the 6 personality types in RIASEC. The personality types are given as follows (also see wikipedia for deeper description).

• Realistic (R): People that like to work with things. They tend to be “assertive and competitive, and are interested in activities requiring motor coordination, skill and strength”. They approach problem solving “by doing something, rather than talking about it, or sitting and thinking about it”. They also prefer “concrete approaches to problem solving, rather than abstract theory”. Finally, their interests tend to focus on “scientific or mechanical rather than cultural and aesthetic areas”.
• Investigative (I): People who prefer to work with “data.” They like to “think and observe rather than act, to organize and understand information rather than to persuade”. They also prefer “individual rather than people oriented activities”.
• Artistic (A): People who like to work with “ideas and things”. They tend to be “creative, open, inventive, original, perceptive, sensitive, independent and emotional”. They rebel against “structure and rules”, but enjoy “tasks involving people or physical skills”. They tend to be more emotional than the other types.
• Social (S): People who like to work with “people” and who “seem to satisfy their needs in teaching or helping situations”. They tend to be “drawn more to seek close relationships with other people and are less apt to want to be really intellectual or physical”.
• Enterprising (E): People who like to work with “people and data”. They tend to be “good talkers, and use this skill to lead or persuade others”. They “also value reputation, power, money and status”.
• Conventional (C): People who prefer to work with “data” and who “like rules and regulations and emphasize self-control … they like structure and order, and dislike unstructured or unclear work and interpersonal situations”. They also “place value on reputation, power, or status”.

In the test they have rated themselves from 1 to 5 (1=Dislike, 3=Neutral, 5=Enjoy) on statements related to these 6 personality types.

That way each respondent can be rated on these 6 dimensions.

## Step 2: Prepare the dataset

We want to score the respondent according to how they have rated themselves on the 8 statements for each of the 6 personality types.

This can be achieved by the following code.

```import pandas as pd

def sum_dimension(data, letter):
return data[letter + '1'] + data[letter + '2'] + data[letter + '3'] + data[letter + '4'] + data[letter + '5'] + data[letter + '6'] + data[letter + '7'] + data[letter + '8']

data['R'] = sum_dimension(data, 'R')
data['I'] = sum_dimension(data, 'I')
data['A'] = sum_dimension(data, 'A')
data['S'] = sum_dimension(data, 'S')
data['E'] = sum_dimension(data, 'E')
data['C'] = sum_dimension(data, 'C')

view = data.loc[:,['education', 'R', 'I', 'A', 'S', 'E', 'C']]
print(view)
```

In the view we make, we keep the education with the dimension ratings we have calculated, because we want to see if there is any correlation between education level and personality type.

We get the following output.

```        education   R   I   A   S   E   C
0               2  20  33  27  37  16  12
1               2  14  35  19  22  10  10
2               2   9  11  11  30  24  16
3               1  15  21  27  20  25  19
4               3  13  36  34  37  20  26
...           ...  ..  ..  ..  ..  ..  ..
145823          3  10  19  28  28  20  13
145824          3  11  18  39  35  24  16
145825          2   8   8   8  36  12  21
145826          3  29  29  29  34  16  19
145827          2  21  33  19  30  27  24
```

Where we see the dimensions ratings and the corresponding education level.

## Step 3: Compute the correlations

The education is given by the following scale.

• 1: Less than high school
• 2: High school
• 3: University degree

Hence, we need to remove the no-answer group (0) from the data to not skew the results.

```import pandas as pd

def sum_dimension(data, letter):
return data[letter + '1'] + data[letter + '2'] + data[letter + '3'] + data[letter + '4'] + data[letter + '5'] + data[letter + '6'] + data[letter + '7'] + data[letter + '8']

data['R'] = sum_dimension(data, 'R')
data['I'] = sum_dimension(data, 'I')
data['A'] = sum_dimension(data, 'A')
data['S'] = sum_dimension(data, 'S')
data['E'] = sum_dimension(data, 'E')
data['C'] = sum_dimension(data, 'C')

view = data.loc[:,['education', 'R', 'I', 'A', 'S', 'E', 'C']]

view = view[view['education'] != 0]

print(view.mean())
print(view.groupby('education').mean())
print(view.corr())
```

The output of the mean is given here.

```education     2.394318
R            16.651624
I            23.994637
A            22.887701
S            26.079349
E            20.490080
C            19.105188
dtype: float64
```

Which says that the average educational level of the 145,000+ respondents was 2.394318. Then you can see the respondent related on average mostly as Social, then Investigative. The lowest rated group was Realistic.

The output of educational group by mean is given here.

```                   R          I          A          S          E          C
education
1          15.951952  23.103728  21.696007  23.170792  19.897772  17.315641
2          16.775297  23.873645  22.379625  25.936032  20.864591  19.551138
3          16.774487  24.302158  23.634034  27.317784  20.468160  19.606312
4          16.814534  24.769829  24.347250  27.382699  20.038501  18.762395
```

Where you can see that those with less than high school actually rate themselves lower in all dimensions. While the highest educated rate themselves highest on Realistic, Artistic, and Social.

Does that mean the higher education the more social, artistic or realistic you are?

The output of the correlation is given here.

```           education         R         I         A         S         E         C
education   1.000000  0.029008  0.057466  0.105946  0.168640 -0.006115  0.044363
R           0.029008  1.000000  0.303895  0.206085  0.109370  0.340535  0.489504
I           0.057466  0.303895  1.000000  0.334159  0.232608  0.080878  0.126554
A           0.105946  0.206085  0.334159  1.000000  0.350631  0.322099  0.056576
S           0.168640  0.109370  0.232608  0.350631  1.000000  0.411564  0.213413
E          -0.006115  0.340535  0.080878  0.322099  0.411564  1.000000  0.526813
C           0.044363  0.489504  0.126554  0.056576  0.213413  0.526813  1.000000
```

As you see. You should conclude that. Take Social it is only 0.168640 correlated to education, which in other words means very low correlated. The same holds for Realistic and Artistic, very low correlation.

## Step 4: Visualize our findings

A way to visualize the data is by using the great integration with Matplotlib.

```import pandas as pd
import matplotlib.pyplot as plt

def sum_dimension(data, letter):
return data[letter + '1'] + data[letter + '2'] + data[letter + '3'] + data[letter + '4'] + data[letter + '5'] + data[letter + '6'] + data[letter + '7'] + data[letter + '8']

data['R'] = sum_dimension(data, 'R')
data['I'] = sum_dimension(data, 'I')
data['A'] = sum_dimension(data, 'A')
data['S'] = sum_dimension(data, 'S')
data['E'] = sum_dimension(data, 'E')
data['C'] = sum_dimension(data, 'C')

view = data.loc[:,['education', 'R', 'I', 'A', 'S', 'E', 'C']]

view = view[view['education'] != 0]

edu = view.groupby('education').mean()
edu.index = ['> high school', 'high school', 'university', 'graduate']
edu.plot(kind='barh', figsize=(10,4))
plt.show()
```

Resulting in the following graph.

Finally, the correlation to education can be made similarly. Note that the education itself was kept to have a perspective of full correlation.

Continue to read how to explore the dataset in the next tutorial.

## What will we cover in this tutorial?

We will continue our journey to explore a big dataset of 145,000+ respondents to a RIASEC test. If you want to explore the full journey, we recommend you read this tutorial first.

In this tutorial we will find some data points that are not correct and a potential way to deal with it.

## Step 1: Explore the family sizes from the respondents

In the first tutorial we looked at how the respondent were distributed around the world. Surprisingly, most countries were represented.

In this we will explore the dataset further. The dataset is available here.

```import pandas as pd

# Only to get a broader summary
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 30)
pd.set_option('display.width', 1000)

print(data)
```

Which will output the following.

```        R1  R2  R3  R4  R5  R6  R7  R8  I1  I2  I3  I4  I5  I6  I7  ...  gender  engnat  age  hand  religion  orientation  race  voted  married  familysize  uniqueNetworkLocation  country  source                major  Unnamed: 93
0        3   4   3   1   1   4   1   3   5   5   4   3   4   5   4  ...       1       1   14     1         7            1     1      2        1           1                      1       US       2                  NaN          NaN
1        1   1   2   4   1   2   2   1   5   5   5   4   4   4   4  ...       1       1   29     1         7            3     4      1        2           3                      1       US       1              Nursing          NaN
2        2   1   1   1   1   1   1   1   4   1   1   1   1   1   1  ...       2       1   23     1         7            1     4      2        1           1                      1       US       1                  NaN          NaN
3        3   1   1   2   2   2   2   2   4   1   2   4   3   2   3  ...       2       2   17     1         0            1     1      2        1           1                      1       CN       0                  NaN          NaN
4        4   1   1   2   1   1   1   2   5   5   5   3   5   5   5  ...       2       2   18     1         4            3     1      2        1           4                      1       PH       0            education          NaN
```

If you use the slider, I got curious about how family sizes vary around the world. This dataset is obviously not representing any conclusive data on it, but it could be interesting to see if there is any connection to where you are located in the world and family size.

## Step 2: Explore the distribution of family sizes

What often happens in dataset is there might be inaccurate data.

To get a feeling of the data in the column familysize, you can explore it by running this.

```import pandas as pd

print(data['familysize'].describe())
print(pd.cut(data['familysize'], bins=[0,1,2,3,4,5,6,7,10,100, 1000000000]).value_counts())
```

Resulting in the following from the describe output.

```count    1.458280e+05
mean     1.255801e+05
std      1.612271e+07
min      0.000000e+00
25%      2.000000e+00
50%      3.000000e+00
75%      3.000000e+00
max      2.147484e+09
Name: familysize, dtype: float64
```

Where the mean value of family size is 125,580. Well, maybe we don’t count family size the same way, but something is wrong there.

Grouping the data into bins (by using the cut function combined with value_count) you get this output.

```(1, 2]               51664
(2, 3]               38653
(3, 4]               18729
(0, 1]               15901
(4, 5]                8265
(5, 6]                3932
(6, 7]                1928
(7, 10]               1904
(10, 100]              520
(100, 1000000000]       23
Name: familysize, dtype: int64
```

Which indicates 23 families of size greater than 100. Let’s just investigate the sizes in that bucket.

```print(data[data['familysize'] > 100]['familysize'])
```

Giving us this output.

```1212      2147483647
3114      2147483647
5770      2147483647
8524             104
9701             103
21255     2147483647
24003            999
26247     2147483647
27782     2147483647
31451           9999
39294           9045
39298          84579
49033            900
54592            232
58773     2147483647
74745      999999999
78643            123
92457            999
95916            908
102680           666
109429           989
111488       9234785
120489          5000
120505     123456789
122580          5000
137141           394
139226          3425
140377           934
142870    2147483647
145686           377
145706           666
Name: familysize, dtype: int64
```

The integer 2147483647 is interesting as it is the maximum 32-bit positive integer. I think it is safe to say that most family sizes given above 100 are not realistic.

## Step 3: Clean the data

You need to make a decision on these data points that seem to skew your data in a wrong way.

Say, you just decide to visualize it without any adjustment, it would give a misrepresentative picture.

It seems like Iceland has a tradition for big families.

Let’s investigate that.

```print(data[data['country'] == 'IS']['familysize'])
```

Interestingly it give only one line that does not seem correct.

```74745     999999999
```

But as there are only a few respondents the average is the highest.

To clean the data fully, we can make the decision that family sizes above 10 are not correct. I know, that might be set a bit low and you can choose to do something different.

Cleaning the data is simple.

```data = data[data['familysize'] < 10]
```

Magic right? You simply write a conditional that will be vectorized down and only keep those rows of data that fulfill this condition.

## Step 4: Visualize the data

We will use geopandas, matplotlib and pycountry to visualize it. The process is similar to the one in previous tutorial where you can find more details.

```import geopandas
import pandas as pd
import matplotlib.pyplot as plt
import pycountry

# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
try:
return pycountry.countries.lookup(country).alpha_3
except LookupError:
return country

data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)
data = data[data['familysize'] < 10]

country_mean = data.groupby(['alpha3']).mean()

map = world.merge(country_mean, how='left', left_on=['iso_a3'], right_on=['alpha3'])
map.plot('familysize', figsize=(12,4), legend=True)
plt.show()
```

Resulting in the following output.

Looks like there is a one-child policy in China? Again, do not make any conclusions on this data as it is very narrow of this aspect.

## What will we cover in this tutorial

We will explore a dataset with the Holland Code (RIASEC) Test, which is a test that should predict careers and vocational choices by rating questions.

In this part of the exploration, we first focus on loading the data and visualizing where the respondents come from. The dataset contains more than 145,000 responses.

## Step 1: First glance at the data

Let us first try to see what the data contains.

Reading the codebook (the file with the dataset) you see it contains ratings of questions of the 6 categories RIASEC. Then there are 3 elapsed times for the test.

There is a ratings of The Ten Item Personality Inventory. Then a self assessment whether they know 16 words. Finally, a list if metadata on them, like where the respondent network was located (which is a indicator on where the respondent was located in most cases).

Other metadata can be seen explained here.

```education			"How much education have you completed?", 1=Less than high school, 2=High school, 3=University degree, 4=Graduate degree
urban				"What type of area did you live when you were a child?", 1=Rural (country side), 2=Suburban, 3=Urban (town, city)
gender				"What is your gender?", 1=Male, 2=Female, 3=Other
engnat				"Is English your native language?", 1=Yes, 2=No
age					"How many years old are you?"
hand				"What hand do you use to write with?", 1=Right, 2=Left, 3=Both
religion			"What is your religion?", 1=Agnostic, 2=Atheist, 3=Buddhist, 4=Christian (Catholic), 5=Christian (Mormon), 6=Christian (Protestant), 7=Christian (Other), 8=Hindu, 9=Jewish, 10=Muslim, 11=Sikh, 12=Other
orientation			"What is your sexual orientation?", 1=Heterosexual, 2=Bisexual, 3=Homosexual, 4=Asexual, 5=Other
race				"What is your race?", 1=Asian, 2=Arab, 3=Black, 4=Indigenous Australian / Native American / White, 5=Other (There was a coding error in the survey, and three different options were given the same value)
voted				"Have you voted in a national election in the past year?", 1=Yes, 2=No
married				"What is your marital status?", 1=Never married, 2=Currently married, 3=Previously married
familysize			"Including you, how many children did your mother have?"
major				"If you attended a university, what was your major (e.g. "psychology", "English", "civil engineering")?"

These values were also calculated for technical information:

uniqueNetworkLocation	1 if the record is the only one from its network location in the dataset, 2 if there are more than one record. There can be more than one record from the same network if for example that network is shared by a school etc, or it may be because of test retakes
country	The country of the network the user connected from
source	1=from Google, 2=from an internal link on the website, 0=from any other website or could not be determined
```

First step would be to load the data into a DataFrame. If you are new to Pandas DataFrame, we can recommend this tutorial.

```import pandas as pd

pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 150)

print(data)
```

The pd.set_option are only to help get are more rich output, compared to a very small and narrow summary. The actual loading of the data is done by pd.read_csv(…).

Notice that we have renamed the csv file to riasec.csv. As it is a tab-spaced csv, we need to parse that as an argument if it is not using the default comma.

The output from the above code is.

```        R1  R2  R3  R4  R5  ...  uniqueNetworkLocation  country  source                major  Unnamed: 93
0        3   4   3   1   1  ...                      1       US       2                  NaN          NaN
1        1   1   2   4   1  ...                      1       US       1              Nursing          NaN
2        2   1   1   1   1  ...                      1       US       1                  NaN          NaN
3        3   1   1   2   2  ...                      1       CN       0                  NaN          NaN
4        4   1   1   2   1  ...                      1       PH       0            education          NaN
...     ..  ..  ..  ..  ..  ...                    ...      ...     ...                  ...          ...
145823   2   1   1   1   1  ...                      1       US       1        Communication          NaN
145824   1   1   1   1   1  ...                      1       US       1              Biology          NaN
145825   1   1   1   1   1  ...                      1       US       2                  NaN          NaN
145826   3   4   4   5   2  ...                      2       US       0                  yes          NaN
145827   2   4   1   4   2  ...                      1       US       1  Information systems          NaN
```

Interestingly, the dataset contains an unnamed last column with no data. That is because it ends each line with a tab (\t) before new line (\n).

We could clean that up, but as we are only interested in the country counts, we will ignore it in this tutorial.

## Step 3: Count the occurrences of each country

As said, we are only interested in this first tutorial on this dataset to get an idea of where the respondents come from in the world.

The data is located in the ‘country’ column of the DataFrame data.

To group the data, you can use groupby(), which will return af DataFrameGroupBy object. If you apply a size() on that object, it will return a Series with sizes of each group.

```print(data.groupby(['country']).size())
```

Where the first few lines are.

```country
AE        507
AF          8
AG          7
AL        116
AM         10
```

Hence, for each country we will have a count of how many respondents came from that country.

## Step 4: Understand the map data we want to merge it with

To visualize the data, we need some way to have a map.

Here the GeoPandas comes in handy. It contains a nice low-res map of the world you can use.

Let’s just explore that.

```import geopandas
import matplotlib.pyplot as plt

world.plot()
plt.show()
```

Which will make the following map.

This is too easy to be true. No, not really. This is the reality of Python.

We want to merge the data from out world map above with the data of counts for each country.

We need to see how to merge it. To do that let us look at the data from world.

```world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
print(world)
```

Where the first few lines are.

```        pop_est                continent                      name iso_a3   gdp_md_est                                           geometry
0        920938                  Oceania                      Fiji    FJI      8374.00  MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1      53950935                   Africa                  Tanzania    TZA    150600.00  POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2        603253                   Africa                 W. Sahara    ESH       906.50  POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
3      35623680            North America                    Canada    CAN   1674000.00  MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
4     326625791            North America  United States of America    USA  18560000.00  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...
```

First problem arises here. In the other dataset we have 2 letter country codes, in this one they use 3 letter country codes.

## Step 5: Solving the merging problem

Luckily we can use a library called PyCountry.

Let’s add this 3 letter country code to our first dataset by using a lambda function. A lambda? New to lambda function, we recommend you read the this tutorial.

```import pandas as pd
import pycountry

# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
try:
return pycountry.countries.lookup(country).alpha_3
except LookupError:
return country

data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)
```

Basically, we add a new column to the dataset and call it ‘alpha3’ with the three letter country code. We use the function apply, which takes the lambda function that actually calls the function outside, which calls the library.

The reason to so, is that sometimes the pycountry.contries calls makes a lookup exception. We want our program to be robust to that.

Now the data contains a row with the countries in 3 letters like world.

We can now merge the data together. Remember that the data we want to merge needs to be adjusted to be counting on ‘alpha3’ and also we want to convert it to a DataFrame (as size() returns a Series).

```import geopandas
import pandas as pd
import pycountry

# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
try:
return pycountry.countries.lookup(country).alpha_3
except LookupError:
return country

data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)

country_count = data.groupby(['alpha3']).size().to_frame()
country_count.columns = ['count']

map = world.merge(country_count, how='left', left_on=['iso_a3'], right_on=['alpha3'])
print(map)
```

The first few lines are given below.

```        pop_est                continent                      name iso_a3   gdp_md_est                                           geometry    count  \
0        920938                  Oceania                      Fiji    FJI      8374.00  MULTIPOLYGON (((180.00000 -16.06713, 180.00000...     12.0
1      53950935                   Africa                  Tanzania    TZA    150600.00  POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...      9.0
2        603253                   Africa                 W. Sahara    ESH       906.50  POLYGON ((-8.66559 27.65643, -8.66512 27.58948...      NaN
3      35623680            North America                    Canada    CAN   1674000.00  MULTIPOLYGON (((-122.84000 49.00000, -122.9742...   7256.0
4     326625791            North America  United States of America    USA  18560000.00  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...  80579.0
5      18556698                     Asia                Kazakhstan    KAZ    460700.00  POLYGON ((87.35997 49.21498, 86.59878 48.54918...     46.0
```

Notice, that some countries do not have a count. Those a countries with no respondent.

## Step 6: Ready to plot a world map

Now to the hard part, right?

Making a colorful map indicating the number of respondents in a given country.

```import geopandas
import pandas as pd
import matplotlib.pyplot as plt
import pycountry
import numpy as np

# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
try:
return pycountry.countries.lookup(country).alpha_3
except LookupError:
return country

data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)

country_count = data.groupby(['alpha3']).size().to_frame()
country_count.columns = ['count']

map = world.merge(country_count, how='left', left_on=['iso_a3'], right_on=['alpha3'])
map.plot('count', figsize=(10,3), legend=True)
plt.show()
```

It is easy. Just call plot(…) with the first argument to be the column to use. I also change the default figsize, you can play around with that. Finally I add the legend.

Not really satisfying. The problem is that all counties, but USA, have almost identical colors. Looking at the data, you will see that it is because that there are so many respondents in USA that the countries are in the bottom of the scale.

What to do? Use a log-scale.

You can actually do that directly in your DataFrame. By using a NumPy library we can calculate a logarithmic scale.

See the magic.

```import geopandas
import pandas as pd
import matplotlib.pyplot as plt
import pycountry
import numpy as np

# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
try:
return pycountry.countries.lookup(country).alpha_3
except LookupError:
return country

data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)

country_count = data.groupby(['alpha3']).size().to_frame()
country_count.columns = ['count']
country_count['log_count'] = np.log(country_count['count'])

map = world.merge(country_count, how='left', left_on=['iso_a3'], right_on=['alpha3'])
map.plot('log_count', figsize=(10,3), legend=True)
plt.show()
```

Where the new magic is to add the log_count and using np.log(country_count[‘count’]).

Also notice that the plot is now done on ‘log_count’.