Pandas: Determine Correlation Between GDP and Stock Market

What will we cover in this tutorial?

In this tutorial we will explore some aspects of the Pandas-Datareader, which is an invaluable way to get data from many sources, including the World Bank and Yahoo! Finance.

In this tutorial we will investigate if the GDP of a country is correlated to the stock market.

Step 1: Get GDP data from World Bank

In the previous tutorial we looked at the GDP per capita and compared it between countries. GDP per capita is a good way to compare country’s economy between each other.

In this tutorial we will look at the GDP and using the NY.GDP.MKTP.CD indicator of GDP in current US$.

We can extract the data by using using the download function from the Pandas-datareader library.

from pandas_datareader import wb


gdp = wb.download(indicator='NY.GDP.MKTP.CD', country='US', start=1990, end=2019)

print(gdp)

Resulting in the following output.

                    NY.GDP.MKTP.CD
country       year                
United States 2019  21427700000000
              2018  20580223000000
              2017  19485393853000
              2016  18707188235000
              2015  18219297584000
              2014  17521746534000
              2013  16784849190000
              2012  16197007349000
              2011  15542581104000

Step 2: Gathering the stock index

Then we need to gather the data from the stock market. As we look at the US stock market, the S&P 500 index is a good indicator of the market.

The ticker of S&P 500 is ^GSPC (yes, with the ^).

The Yahoo! Finance api is a great place to collect this type of data.

import pandas_datareader as pdr
import datetime as dt


start = dt.datetime(1990, 1, 1)
end = dt.datetime(2019, 12, 31)
sp500 = pdr.get_data_yahoo("^GSPC", start, end)['Adj Close']
print(sp500)

Resulting in the following output.

Date
1990-01-02     359.690002
1990-01-03     358.760010
1990-01-04     355.670013
1990-01-05     352.200012
1990-01-08     353.790009
                 ...     
2019-12-24    3223.379883
2019-12-26    3239.909912
2019-12-27    3240.020020
2019-12-30    3221.290039
2019-12-31    3230.780029

Step 3: Visualizing the data on one plot

A good way to see if there is a correlation is simply by visualizing it.

This can be done with a few tweaks.

import pandas_datareader as pdr
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt
from pandas_datareader import wb


gdp = wb.download(indicator='NY.GDP.MKTP.CD', country='US', start=1990, end=2019)

gdp = gdp.unstack().T.reset_index(0)
gdp.index = pd.to_datetime(gdp.index, format='%Y')


start = dt.datetime(1990, 1, 1)
end = dt.datetime(2019, 12, 31)
sp500 = pdr.get_data_yahoo("^GSPC", start, end)['Adj Close']


data = sp500.to_frame().join(gdp, how='outer')
data = data.interpolate(method='linear')

ax = data['Adj Close'].plot()
ax = data['United States'].plot(ax=ax, secondary_y=True)

plt.show()

The GDP data needs to be formatted differently, by unstack’ing, transposing, and resetting the index. Then the index is converted from being strings of year to actually time series.

We use a outer join to get all the dates in the time series. Then we interpolate with a linear method to fill out the gab in the graph.

Finally, we make a plot af Adj Close of S&P 500 stock index and on of the GDP of United States, where we use the same graph, but using the secondary y-axis to plot. That means, the time series on the x-axis is the same.

The resulting graph is.

US GDP with S&P 500 index

It could look like a correlation, which is visible in the aftermath of 2008.

Step 4: Calculate a correlation

Let’s try to make some correlation calculations.

First, let’s not just rely on how US GDP correlates to the US stock market. Let us try to relate it to other countries GDP and see how they relate to the strongest economy in the world.

import pandas_datareader as pdr
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt
from pandas_datareader import wb


gdp = wb.download(indicator='NY.GDP.MKTP.CD', country=['NO', 'FR', 'US', 'GB', 'DK', 'DE', 'SE'], start=1990, end=2019)

gdp = gdp.unstack().T.reset_index(0)
gdp.index = pd.to_datetime(gdp.index, format='%Y')


start = dt.datetime(1990, 1, 1)
end = dt.datetime(2019, 12, 31)
sp500 = pdr.get_data_yahoo("^GSPC", start, end)['Adj Close']

data = sp500.to_frame().join(gdp, how='outer')
data = data.interpolate(method='linear')

print(data.corr())

Where we compare it the the GDP for some more countries to verify our hypothesis.

                Adj Close   Denmark    France   Germany    Norway    Sweden  United Kingdom  United States
Adj Close        1.000000  0.729701  0.674506  0.727289  0.653507  0.718829        0.759239       0.914303
Denmark          0.729701  1.000000  0.996500  0.986769  0.975780  0.978550        0.955674       0.926139
France           0.674506  0.996500  1.000000  0.982225  0.979767  0.974825        0.945877       0.893780
Germany          0.727289  0.986769  0.982225  1.000000  0.953131  0.972542        0.913443       0.916239
Norway           0.653507  0.975780  0.979767  0.953131  1.000000  0.978784        0.933795       0.878704
Sweden           0.718829  0.978550  0.974825  0.972542  0.978784  1.000000        0.930621       0.916530
United Kingdom   0.759239  0.955674  0.945877  0.913443  0.933795  0.930621        1.000000       0.915859
United States    0.914303  0.926139  0.893780  0.916239  0.878704  0.916530        0.915859       1.000000

Now that is interesting. The US Stock market (Adj Close) correlates the strongest with the US GDP. Not surprising.

Of the chosen countries, the Danish GDP is the second most correlated to US stock market. The GDP of the countries correlate all strongly with the US GDP. There Norway correlates the least.

Continue the exploration of World Bank data.

Pandas: Read GDP per Capita From World Bank

What will we cover in this tutorial?

Introduction to the Pandas-Datareader, which is an invaluable way to get data from many sources, including the World Bank.

In this tutorial we will cover how to get the data from GDP per capita yearly data from countries and plot them.

Step 1: Get to know World Bank as a data source

The World Bank was founded in 1944 to make loans to low-income countries, with the purpose to decrease poverty in the world (see wikipedia.org for further history).

What you might not know, World Bank has an amazing sets of data that you can either browse on their webpage or get access to directly in Python using the Pandas-Datareader.

We will take a look at how to extract the NY.GDP.PCAP.KD indicator.

The what?

I know. The GDP per capita (constant 2010 US$), as it states on the webpage.

From World Bank.

On that page you can get the GDP per capita for each country in the world back to 1960.

That is what we are going to do.

Step 2: Get the data

Reading the Pandas-datareaders World Bank documentation you fall over the following function.

pandas_datareader.wb.download(country=Noneindicator=Nonestart=2003end=2005freq=Noneerrors=’warn’**kwargs)

Where you can set the country (or countries) and indicator your want:

  • country (string or list of strings.) – all downloads data for all countries 2 or 3 character ISO country codes select individual countries (e.g.“US“,“CA“) or (e.g.“USA“,“CAN“). The codes can be mixed.The two ISO lists of countries, provided by wikipedia, are hardcoded into pandas as of 11/10/2014.
  • indicator (string or list of strings) – taken from the id field in WDIsearch()

Luckily we already have our indicator from Step 1 (NY.GDP.PCAP.KD). Then we just need to find some countries of interest.

Let’s take United States, France, Great Britain, Denmark and Norway.

from pandas_datareader import wb


dat = wb.download(indicator='NY.GDP.PCAP.KD', country=['US', 'FR', 'GB', 'DK', 'NO'], start=1960, end=2019)

print(dat)

Resulting in the following output.

                    NY.GDP.PCAP.KD
country       year                
Denmark       2019    65147.427182
              2018    63915.468361
              2017    62733.019808
              2016    61877.976481
              2015    60402.129248
...                            ...
United States 1964    19824.587845
              1963    18999.888387
              1962    18462.935998
              1961    17671.150187
              1960    17562.592084

[300 rows x 1 columns]

Step 3: Visualize the data on a graph

We need to restructure the data in order to make a nice graph.

This can be done with unstack.

from pandas_datareader import wb
import matplotlib.pyplot as plt


dat = wb.download(indicator='NY.GDP.PCAP.KD', country=['US', 'FR', 'GB', 'DK', 'NO'], start=1960, end=2019)

print(dat.unstack())

Which result in this output.

               NY.GDP.PCAP.KD                ...                            
year                     1960          1961  ...          2018          2019
country                                      ...                            
Denmark          20537.549556  21695.609308  ...  63915.468361  65147.427182
France           12743.925100  13203.320855  ...  43720.026351  44317.392315
Norway           23167.441740  24426.011426  ...  92119.522964  92556.321645
United Kingdom   13934.029831  14198.673562  ...  43324.049759  43688.437455
United States    17562.592084  17671.150187  ...  54795.450086  55809.007792

If we transpose this and remove the double index frames that will come, then it should be good to make a plot with.

from pandas_datareader import wb
import matplotlib.pyplot as plt


dat = wb.download(indicator='NY.GDP.PCAP.KD', country=['US', 'FR', 'GB', 'DK', 'NO'], start=1960, end=2019)

print(dat.unstack().T.reset_index(0))

dat.unstack().T.reset_index(0).plot()
plt.title('GDP per capita')
plt.show()

Giving this output, where you can see what the Transpose (T) does.

country         level_0       Denmark  ...  United Kingdom  United States
year                                   ...                               
1960     NY.GDP.PCAP.KD  20537.549556  ...    13934.029831   17562.592084
1961     NY.GDP.PCAP.KD  21695.609308  ...    14198.673562   17671.150187
1962     NY.GDP.PCAP.KD  22747.292463  ...    14233.959944   18462.935998
1963     NY.GDP.PCAP.KD  22712.577808  ...    14816.480305   18999.888387
1964     NY.GDP.PCAP.KD  24620.461432  ...    15535.026991   19824.587845
1965     NY.GDP.PCAP.KD  25542.173921  ...    15766.195724   20831.299767
1966     NY.GDP.PCAP.KD  26032.378816  ...    15926.169851   21930.591173
1967     NY.GDP.PCAP.KD  27256.322071  ...    16282.026160   22235.415708

Giving the following output.

GDP per capita for 1960-2019.

Continue the exploration in the following tutorial.

Pandas: Explore Datasets by Visualization – Exploring the Holland Code (RIASEC) Test – Part IV

What will we cover in this tutorial?

We will continue our journey to explore a big dataset of 145,000+ respondents to a RIASEC test. If you want to explore the full journey, we recommend you read this tutorial first as well as the second part of the tutorial, and finally, the third part before continuing.

In this part we will investigate if we can see any correlation between the major of education and the 6 dimensions of the personality types in RIASEC.

Step 1: Group into major of educations

This is getting tricky, as the majors are typed in by the respondent. We will be missing some connections between them.

But let’s start by exploring them.

import pandas as pd


data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
major = data.loc[:,['major']]

print(major.groupby('major').size().sort_values(ascending=False))

The output is given here.

major
psychology                6861
Psychology                5763
English                   2342
Business                  2290
Biology                   1289
                          ... 
Sociology, Social work       1
Sociology, Psychology        1
Sociology, Math              1
Sociology, Linguistics       1
Nuerobiology                 1
Length: 15955, dtype: int64

Where we identify one problem, that some write with lowercase and others with uppercase.

Step 2: Clean up a few ambiguities

The first step would be to lowercase everything.

import pandas as pd


data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
major = data.loc[:,['major']]
major['major'] = major['major'].str.lower()
print(major.groupby('major').size().sort_values(ascending=False).iloc[:10])

Now printing the 10 first lines.

major
psychology          12766
business             3496
english              3042
nursing              2142
biology              1961
education            1800
engineering          1353
accounting           1186
computer science     1159
psychology           1098
dtype: int64

Where we notice that psychology is the first and last. Inspecting it further, it seems the the last one has a space after it. Hence, we can try to remove whitespaces around all educations.

import pandas as pd
import numpy as np


data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
major = data.loc[:,['major']]
major['major'] = major['major'].str.lower()
major['major'] = major.apply(lambda row: row['major'].strip() if row['major'] is not np.nan else np.nan, axis=1)

print(major.groupby('major').size().sort_values(ascending=False).iloc[:10])

Now the output is as follows.

major
psychology          13878
business             3848
english              3240
nursing              2396
biology              2122
education            1954
engineering          1504
accounting           1292
computer science     1240
law                  1111
dtype: int64

Introducing law at the bottom of the list.

This process could continue, but let’s keep the focus on these 10 highest representative educations in the dataset. Obviously, further data points could be added if investigating it further.

Step 3: See if education correlates to known words

First let’s explore the dataset a bit more. The respondents are asked if they know the definitions of the following words.

  • boat
  • incoherent
  • pallid
  • robot
  • audible
  • cuivocal
  • paucity
  • epistemology
  • florted
  • decide
  • pastiche
  • verdid
  • abysmal
  • lucid
  • betray
  • funny

Each word they know they mark. Hence, we can count the number of words each respondent knows and calculate an average per major group.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)

data['VCL'] = data['VCL1'] + data['VCL2'] + data['VCL3'] + data['VCL4'] + data['VCL5'] + data['VCL6'] + data['VCL7'] + data['VCL8'] + data['VCL9'] + data['VCL10'] + data['VCL11'] + data['VCL12'] + data['VCL13'] + data['VCL14'] + data['VCL15'] + data['VCL16']

view = data.loc[:, ['VCL', 'major']]
view['major'] = view['major'].str.lower()
view['major'] = view.apply(lambda row: row['major'].strip() if row['major'] is not np.nan else np.nan, axis=1)


view = view.groupby('major').aggregate(['mean', 'count'])
view = view[view['VCL','count'] > 1110]
view.loc[:,('VCL','mean')].plot(kind='barh', figsize=(14,5))
plt.show()

Which results in the following output.

Average number of the 16 words that each major knows.

The Engineers seem to score lower than nursing. Well, I am actually surprised that Computer Science scores that high.

Step 4: Adding it all up together

Let’s use what we did in previous tutorial and use the calculations from there.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


def sum_dimension(data, letter):
    return data[letter + '1'] + data[letter + '2'] + data[letter + '3'] + data[letter + '4'] + data[letter + '5'] + data[letter + '6'] + data[letter + '7'] + data[letter + '8']


data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['R'] = sum_dimension(data, 'R')
data['I'] = sum_dimension(data, 'I')
data['A'] = sum_dimension(data, 'A')
data['S'] = sum_dimension(data, 'S')
data['E'] = sum_dimension(data, 'E')
data['C'] = sum_dimension(data, 'C')
data['VCL'] = data['VCL1'] + data['VCL2'] + data['VCL3'] + data['VCL4'] + data['VCL5'] + data['VCL6'] + data['VCL7'] + data['VCL8'] + data['VCL9'] + data['VCL10'] + data['VCL11'] + data['VCL12'] + data['VCL13'] + data['VCL14'] + data['VCL15'] + data['VCL16']

view = data.loc[:, ['R', 'I', 'A', 'S', 'E', 'C', 'VCL', 'major']]
view['major'] = view['major'].str.lower()
view['major'] = view.apply(lambda row: row['major'].strip() if row['major'] is not np.nan else np.nan, axis=1)


view = view.groupby('major').aggregate(['mean', 'count'])
view = view[view['VCL','count'] > 1110]
view.loc[:,[('R','mean'), ('I','mean'),('A','mean'), ('S','mean'),('C','mean'), ('C','mean')]].plot(kind='barh', figsize=(14,5))
plt.show()

Which results in the following diagram.

Correlation between major and RIASEC personality traits

Biology has high I (Investigative, people that prefer to work with data). While the R (Realistic, People who like to work with things) is dominated by Engineers and Computer Scientist.

Hmm… I should have noticed that many have major education.

Pandas: Explore Datasets by Visualization – Exploring the Holland Code (RIASEC) Test – Part II

What will we cover in this tutorial?

We will continue our journey to explore a big dataset of 145,000+ respondents to a RIASEC test. If you want to explore the full journey, we recommend you read this tutorial first.

In this tutorial we will find some data points that are not correct and a potential way to deal with it.

Step 1: Explore the family sizes from the respondents

In the first tutorial we looked at how the respondent were distributed around the world. Surprisingly, most countries were represented.

From previous tutorial.

In this we will explore the dataset further. The dataset is available here.

import pandas as pd

# Only to get a broader summary
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 30)
pd.set_option('display.width', 1000)


data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
print(data)

Which will output the following.

        R1  R2  R3  R4  R5  R6  R7  R8  I1  I2  I3  I4  I5  I6  I7  ...  gender  engnat  age  hand  religion  orientation  race  voted  married  familysize  uniqueNetworkLocation  country  source                major  Unnamed: 93
0        3   4   3   1   1   4   1   3   5   5   4   3   4   5   4  ...       1       1   14     1         7            1     1      2        1           1                      1       US       2                  NaN          NaN
1        1   1   2   4   1   2   2   1   5   5   5   4   4   4   4  ...       1       1   29     1         7            3     4      1        2           3                      1       US       1              Nursing          NaN
2        2   1   1   1   1   1   1   1   4   1   1   1   1   1   1  ...       2       1   23     1         7            1     4      2        1           1                      1       US       1                  NaN          NaN
3        3   1   1   2   2   2   2   2   4   1   2   4   3   2   3  ...       2       2   17     1         0            1     1      2        1           1                      1       CN       0                  NaN          NaN
4        4   1   1   2   1   1   1   2   5   5   5   3   5   5   5  ...       2       2   18     1         4            3     1      2        1           4                      1       PH       0            education          NaN

If you use the slider, I got curious about how family sizes vary around the world. This dataset is obviously not representing any conclusive data on it, but it could be interesting to see if there is any connection to where you are located in the world and family size.

Step 2: Explore the distribution of family sizes

What often happens in dataset is there might be inaccurate data.

To get a feeling of the data in the column familysize, you can explore it by running this.

import pandas as pd


data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)

print(data['familysize'].describe())
print(pd.cut(data['familysize'], bins=[0,1,2,3,4,5,6,7,10,100, 1000000000]).value_counts())

Resulting in the following from the describe output.

count    1.458280e+05
mean     1.255801e+05
std      1.612271e+07
min      0.000000e+00
25%      2.000000e+00
50%      3.000000e+00
75%      3.000000e+00
max      2.147484e+09
Name: familysize, dtype: float64

Where the mean value of family size is 125,580. Well, maybe we don’t count family size the same way, but something is wrong there.

Grouping the data into bins (by using the cut function combined with value_count) you get this output.

(1, 2]               51664
(2, 3]               38653
(3, 4]               18729
(0, 1]               15901
(4, 5]                8265
(5, 6]                3932
(6, 7]                1928
(7, 10]               1904
(10, 100]              520
(100, 1000000000]       23
Name: familysize, dtype: int64

Which indicates 23 families of size greater than 100. Let’s just investigate the sizes in that bucket.

print(data[data['familysize'] > 100]['familysize'])

Giving us this output.

1212      2147483647
3114      2147483647
5770      2147483647
8524             104
9701             103
21255     2147483647
24003            999
26247     2147483647
27782     2147483647
31451           9999
39294           9045
39298          84579
49033            900
54592            232
58773     2147483647
74745      999999999
78643            123
92457            999
95916            908
102680           666
109429           989
111488       9234785
120489          5000
120505     123456789
122580          5000
137141           394
139226          3425
140377           934
142870    2147483647
145686           377
145706           666
Name: familysize, dtype: int64

The integer 2147483647 is interesting as it is the maximum 32-bit positive integer. I think it is safe to say that most family sizes given above 100 are not realistic.

Step 3: Clean the data

You need to make a decision on these data points that seem to skew your data in a wrong way.

Say, you just decide to visualize it without any adjustment, it would give a misrepresentative picture.

Iceland? What’s up?

It seems like Iceland has a tradition for big families.

Let’s investigate that.

print(data[data['country'] == 'IS']['familysize'])

Interestingly it give only one line that does not seem correct.

74745     999999999

But as there are only a few respondents the average is the highest.

To clean the data fully, we can make the decision that family sizes above 10 are not correct. I know, that might be set a bit low and you can choose to do something different.

Cleaning the data is simple.

data = data[data['familysize'] < 10]

Magic right? You simply write a conditional that will be vectorized down and only keep those rows of data that fulfill this condition.

Step 4: Visualize the data

We will use geopandas, matplotlib and pycountry to visualize it. The process is similar to the one in previous tutorial where you can find more details.

import geopandas
import pandas as pd
import matplotlib.pyplot as plt
import pycountry

# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
    try:
        return pycountry.countries.lookup(country).alpha_3
    except LookupError:
        return country


data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)


data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)
data = data[data['familysize'] < 10]

country_mean = data.groupby(['alpha3']).mean()

world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
map = world.merge(country_mean, how='left', left_on=['iso_a3'], right_on=['alpha3'])
map.plot('familysize', figsize=(12,4), legend=True)
plt.show()

Resulting in the following output.

Family sizes of the respondents

Looks like there is a one-child policy in China? Again, do not make any conclusions on this data as it is very narrow of this aspect.

Read the next part here:

Pandas: Explore Datasets by Visualization – Exploring the Holland Code (RIASEC) Test

What will we cover in this tutorial

We will explore a dataset with the Holland Code (RIASEC) Test, which is a test that should predict careers and vocational choices by rating questions.

In this part of the exploration, we first focus on loading the data and visualizing where the respondents come from. The dataset contains more than 145,000 responses.

You can download the dataset here.

Step 1: First glance at the data

Let us first try to see what the data contains.

Reading the codebook (the file with the dataset) you see it contains ratings of questions of the 6 categories RIASEC. Then there are 3 elapsed times for the test.

There is a ratings of The Ten Item Personality Inventory. Then a self assessment whether they know 16 words. Finally, a list if metadata on them, like where the respondent network was located (which is a indicator on where the respondent was located in most cases).

Other metadata can be seen explained here.

education			"How much education have you completed?", 1=Less than high school, 2=High school, 3=University degree, 4=Graduate degree
urban				"What type of area did you live when you were a child?", 1=Rural (country side), 2=Suburban, 3=Urban (town, city)
gender				"What is your gender?", 1=Male, 2=Female, 3=Other
engnat				"Is English your native language?", 1=Yes, 2=No
age					"How many years old are you?"
hand				"What hand do you use to write with?", 1=Right, 2=Left, 3=Both
religion			"What is your religion?", 1=Agnostic, 2=Atheist, 3=Buddhist, 4=Christian (Catholic), 5=Christian (Mormon), 6=Christian (Protestant), 7=Christian (Other), 8=Hindu, 9=Jewish, 10=Muslim, 11=Sikh, 12=Other
orientation			"What is your sexual orientation?", 1=Heterosexual, 2=Bisexual, 3=Homosexual, 4=Asexual, 5=Other
race				"What is your race?", 1=Asian, 2=Arab, 3=Black, 4=Indigenous Australian / Native American / White, 5=Other (There was a coding error in the survey, and three different options were given the same value)
voted				"Have you voted in a national election in the past year?", 1=Yes, 2=No
married				"What is your marital status?", 1=Never married, 2=Currently married, 3=Previously married
familysize			"Including you, how many children did your mother have?"		
major				"If you attended a university, what was your major (e.g. "psychology", "English", "civil engineering")?"


These values were also calculated for technical information:

uniqueNetworkLocation	1 if the record is the only one from its network location in the dataset, 2 if there are more than one record. There can be more than one record from the same network if for example that network is shared by a school etc, or it may be because of test retakes
country	The country of the network the user connected from
source	1=from Google, 2=from an internal link on the website, 0=from any other website or could not be determined

Step 2: Loading the data into a DataFrame (Pandas)

First step would be to load the data into a DataFrame. If you are new to Pandas DataFrame, we can recommend this tutorial.

import pandas as pd


pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 150)

data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)

print(data)

The pd.set_option are only to help get are more rich output, compared to a very small and narrow summary. The actual loading of the data is done by pd.read_csv(…).

Notice that we have renamed the csv file to riasec.csv. As it is a tab-spaced csv, we need to parse that as an argument if it is not using the default comma.

The output from the above code is.

        R1  R2  R3  R4  R5  ...  uniqueNetworkLocation  country  source                major  Unnamed: 93
0        3   4   3   1   1  ...                      1       US       2                  NaN          NaN
1        1   1   2   4   1  ...                      1       US       1              Nursing          NaN
2        2   1   1   1   1  ...                      1       US       1                  NaN          NaN
3        3   1   1   2   2  ...                      1       CN       0                  NaN          NaN
4        4   1   1   2   1  ...                      1       PH       0            education          NaN
...     ..  ..  ..  ..  ..  ...                    ...      ...     ...                  ...          ...
145823   2   1   1   1   1  ...                      1       US       1        Communication          NaN
145824   1   1   1   1   1  ...                      1       US       1              Biology          NaN
145825   1   1   1   1   1  ...                      1       US       2                  NaN          NaN
145826   3   4   4   5   2  ...                      2       US       0                  yes          NaN
145827   2   4   1   4   2  ...                      1       US       1  Information systems          NaN

Interestingly, the dataset contains an unnamed last column with no data. That is because it ends each line with a tab (\t) before new line (\n).

We could clean that up, but as we are only interested in the country counts, we will ignore it in this tutorial.

Step 3: Count the occurrences of each country

As said, we are only interested in this first tutorial on this dataset to get an idea of where the respondents come from in the world.

The data is located in the ‘country’ column of the DataFrame data.

To group the data, you can use groupby(), which will return af DataFrameGroupBy object. If you apply a size() on that object, it will return a Series with sizes of each group.

print(data.groupby(['country']).size())

Where the first few lines are.

country
AD          2
AE        507
AF          8
AG          7
AL        116
AM         10

Hence, for each country we will have a count of how many respondents came from that country.

Step 4: Understand the map data we want to merge it with

To visualize the data, we need some way to have a map.

Here the GeoPandas comes in handy. It contains a nice low-res map of the world you can use.

Let’s just explore that.

import geopandas
import matplotlib.pyplot as plt

world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
world.plot()
plt.show()

Which will make the following map.

World map using GeoPandas and Maplotlib

This is too easy to be true. No, not really. This is the reality of Python.

We want to merge the data from out world map above with the data of counts for each country.

We need to see how to merge it. To do that let us look at the data from world.

world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
print(world)

Where the first few lines are.

        pop_est                continent                      name iso_a3   gdp_md_est                                           geometry
0        920938                  Oceania                      Fiji    FJI      8374.00  MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1      53950935                   Africa                  Tanzania    TZA    150600.00  POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2        603253                   Africa                 W. Sahara    ESH       906.50  POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
3      35623680            North America                    Canada    CAN   1674000.00  MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
4     326625791            North America  United States of America    USA  18560000.00  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...

First problem arises here. In the other dataset we have 2 letter country codes, in this one they use 3 letter country codes.

Step 5: Solving the merging problem

Luckily we can use a library called PyCountry.

Let’s add this 3 letter country code to our first dataset by using a lambda function. A lambda? New to lambda function, we recommend you read the this tutorial.

import pandas as pd
import pycountry


# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
    try:
        return pycountry.countries.lookup(country).alpha_3
    except LookupError:
        return country

data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)

data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)

Basically, we add a new column to the dataset and call it ‘alpha3’ with the three letter country code. We use the function apply, which takes the lambda function that actually calls the function outside, which calls the library.

The reason to so, is that sometimes the pycountry.contries calls makes a lookup exception. We want our program to be robust to that.

Now the data contains a row with the countries in 3 letters like world.

We can now merge the data together. Remember that the data we want to merge needs to be adjusted to be counting on ‘alpha3’ and also we want to convert it to a DataFrame (as size() returns a Series).

import geopandas
import pandas as pd
import pycountry


# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
    try:
        return pycountry.countries.lookup(country).alpha_3
    except LookupError:
        return country


data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)

country_count = data.groupby(['alpha3']).size().to_frame()
country_count.columns = ['count']

world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
map = world.merge(country_count, how='left', left_on=['iso_a3'], right_on=['alpha3'])
print(map)

The first few lines are given below.

        pop_est                continent                      name iso_a3   gdp_md_est                                           geometry    count  \
0        920938                  Oceania                      Fiji    FJI      8374.00  MULTIPOLYGON (((180.00000 -16.06713, 180.00000...     12.0   
1      53950935                   Africa                  Tanzania    TZA    150600.00  POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...      9.0   
2        603253                   Africa                 W. Sahara    ESH       906.50  POLYGON ((-8.66559 27.65643, -8.66512 27.58948...      NaN   
3      35623680            North America                    Canada    CAN   1674000.00  MULTIPOLYGON (((-122.84000 49.00000, -122.9742...   7256.0   
4     326625791            North America  United States of America    USA  18560000.00  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...  80579.0   
5      18556698                     Asia                Kazakhstan    KAZ    460700.00  POLYGON ((87.35997 49.21498, 86.59878 48.54918...     46.0   

Notice, that some countries do not have a count. Those a countries with no respondent.

Step 6: Ready to plot a world map

Now to the hard part, right?

Making a colorful map indicating the number of respondents in a given country.

import geopandas
import pandas as pd
import matplotlib.pyplot as plt
import pycountry
import numpy as np


# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
    try:
        return pycountry.countries.lookup(country).alpha_3
    except LookupError:
        return country


data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)

country_count = data.groupby(['alpha3']).size().to_frame()
country_count.columns = ['count']

world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
map = world.merge(country_count, how='left', left_on=['iso_a3'], right_on=['alpha3'])
map.plot('count', figsize=(10,3), legend=True)
plt.show()

It is easy. Just call plot(…) with the first argument to be the column to use. I also change the default figsize, you can play around with that. Finally I add the legend.

The output

Not really satisfying. The problem is that all counties, but USA, have almost identical colors. Looking at the data, you will see that it is because that there are so many respondents in USA that the countries are in the bottom of the scale.

What to do? Use a log-scale.

You can actually do that directly in your DataFrame. By using a NumPy library we can calculate a logarithmic scale.

See the magic.

import geopandas
import pandas as pd
import matplotlib.pyplot as plt
import pycountry
import numpy as np


# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
    try:
        return pycountry.countries.lookup(country).alpha_3
    except LookupError:
        return country


data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)

country_count = data.groupby(['alpha3']).size().to_frame()
country_count.columns = ['count']
country_count['log_count'] = np.log(country_count['count'])

world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
map = world.merge(country_count, how='left', left_on=['iso_a3'], right_on=['alpha3'])
map.plot('log_count', figsize=(10,3), legend=True)
plt.show()

Where the new magic is to add the log_count and using np.log(country_count[‘count’]).

Also notice that the plot is now done on ‘log_count’.

The final output.

Now you see more of a variety in the countries respondents. Note that the “white” countries did not have any respondent.

Read the next exploration of the dataset here.

Next exploration.

NumPy: Calculate the Julia Set with Vectorization

What will we cover in this tutorial?

In this tutorial you will learn what the Julia set is and understand how it is calculated. Also, how it translates into colorful images. In the process, we will learn how to utilize vectorization with NumPy arrays to achieve it.

Step 1: Understand the Julia set

Juila set are closely connect to the Mandelbrot set. If you are new to the Mandelbrot set, we recommend you read this tutorial before you proceed, as it will make it easier to understand.

Read this tutorial before if you are new to Mandelbrot and Julia sets.

Julia sets can be calculated for a function f. If we consider the function f_c(z) = z^2 + c, for a complex number c, then this function is used in the Mandelbrot set.

Recall the Mandelbrot set is calculated by identifying for a point c whether the function f_c(z) = z^2 + c , for which the sequence f_c(0), f_c(f_c(0)), f_c(f_c(f_c(0))), …., does not diverge.

Said differently, for each point c on the complex plane, if the sequence does not diverge, then that point is in the Mandelbrot set.

The Julia set has c fixed and and calculates the same sequence for z in the complex plane. That is, for each point z in the complex plane if the sequence f_c(0), f_c(f_c(0)), f_c(f_c(f_c(0))), …., does not diverge it is part of the Julia set.

Step 2: Pseudo code for Julia set of non-vectorization computation

The best way to understand is often to see the non-vectorization method to compute the Julia set.

As we consider the function f_c(z) = z^2 + c for our Julia set, we need to choose a complex number for c. Note, that complex number c can be set differently to get another Julia set.

Then each we can iterate over each point z in the complex plane.

c = -0.8 + i*0.34
for x in [-1, 1] do:
  for y in [-1, 1] do:
    z = x + i*y
    N = 0
    while absolute(z) < 2 and N < MAX_ITERATIONS:
      z = z^2 + c
    set color for x,y to N

This provides beautiful color images of the Julia set.

Julia set generated from the implementation below.

Step 3: The vectorization computation using NumPy arrays

How does that translate into code using NumPy?

import numpy as np
import matplotlib.pyplot as plt


def julia_set(c=-0.4 + 0.6j, height=800, width=1000, x=0, y=0, zoom=1, max_iterations=100):
    # To make navigation easier we calculate these values
    x_width = 1.5
    y_height = 1.5*height/width
    x_from = x - x_width/zoom
    x_to = x + x_width/zoom
    y_from = y - y_height/zoom
    y_to = y + y_height/zoom

    # Here the actual algorithm starts
    x = np.linspace(x_from, x_to, width).reshape((1, width))
    y = np.linspace(y_from, y_to, height).reshape((height, 1))
    z = x + 1j * y

    # Initialize z to all zero
    c = np.full(z.shape, c)

    # To keep track in which iteration the point diverged
    div_time = np.zeros(z.shape, dtype=int)
    # To keep track on which points did not converge so far
    m = np.full(c.shape, True, dtype=bool)

    for i in range(max_iterations):
        z[m] = z[m]**2 + c[m]

        m[np.abs(z) > 2] = False

        div_time[m] = i
    return div_time


plt.imshow(julia_set(), cmap='magma')
# plt.imshow(julia_set(x=0.125, y=0.125, zoom=10), cmap='magma')
# plt.imshow(julia_set(c=-0.8j), cmap='magma')
# plt.imshow(julia_set(c=-0.8+0.156j, max_iterations=512), cmap='magma')
# plt.imshow(julia_set(c=-0.7269 + 0.1889j, max_iterations=256), cmap='magma')

plt.show()
Generated from the code above.
Generated from the code above.

NumPy: Compute Mandelbrot set by Vectorization

What will we cover in this tutorial?

  • Understand what the Mandelbrot set it and why it is so fascinating.
  • Master how to make images in multiple colors of the Mandelbrot set.
  • How to implement it using NumPy vectorization.

Step 1: What is Mandelbrot?

Mandelbrot is a set of complex numbers for which the function f(z) = z^2 + c does not converge when iterated from z=0 (from wikipedia).

Take a complex number, c, then you calculate the sequence for N iterations:

z_(n+1) = z_n + c for n = 0, 1, …, N-1

If absolute(z_(N-1)) < 2, then it is said not to diverge and is part of the Mandelbrot set.

The Mandelbrot set is part of the complex plane, which is colored by numbers part of the Mandelbrot set and not.

Mandelbrot set.

This only gives a block and white colored image of the complex plane, hence often the images are made more colorful by giving it colors by the iteration number it diverged. That is if z_4 diverged for a point in the complex plane, then it will be given the color 4. That is how you end up with colorful maps like this.

Mandelbrot set (made by program from this tutorial).

Step 2: Understand the code of the non-vectorized approach to compute the Mandelbrot set

To better understand the images from the Mandelbrot set, think of the complex numbers as a diagram, where the real part of the complex number is x-axis and the imaginary part is y-axis (also called the Argand diagram).

Argand diagram

Then each point is a complex number c. That complex number will be given a color depending on which iteration it diverges (if it is not part of the Mandelbrot set).

Now the pseudocode for that should be easy to digest.

for x in [-2, 2] do:
  for y in [-1.5, 1.5] do:
    c = x + i*y
    z = 0
    N = 0
    while absolute(z) < 2 and N < MAX_ITERATIONS:
      z = z^2 + c
    set color for x,y to N

Simple enough to understand. That is some of the beauty of it. The simplicity.

Step 3: Make a vectorized version of the computations

Now we understand the concepts behind we should translate that into to a vectorized version. If you are new to vectorization we can recommend you read this tutorial first.

What do we achieve with vectorization? That we compute all the complex numbers simultaneously. To understand that inspect the initialization of all the points here.

import numpy as np

def mandelbrot(height, width, x_from=-2, x_to=1, y_from=-1.5, y_to=1.5, max_iterations=100):
    x = np.linspace(x_from, x_to, width).reshape((1, width))
    y = np.linspace(y_from, y_to, height).reshape((height, 1))
    c = x + 1j * y

You see that we initialize all the x-coordinates at once using the linespace. It will create an array with numbers from x_from to x_to in width points. The reshape is to fit the plane.

The same happens for y.

Then all the complex numbers are created in c = x + 1j*y, where 1j is the imaginary part of the complex number.

This leaves us to the full implementation.

There are two things we need to keep track of in order to make a colorful Mandelbrot set. First, in which iteration the point diverged. Second, to achieve that, we need to remember when a point diverged.

import numpy as np
import matplotlib.pyplot as plt


def mandelbrot(height, width, x=-0.5, y=0, zoom=1, max_iterations=100):
    # To make navigation easier we calculate these values
    x_width = 1.5
    y_height = 1.5*height/width
    x_from = x - x_width/zoom
    x_to = x + x_width/zoom
    y_from = y - y_height/zoom
    y_to = y + y_height/zoom

    # Here the actual algorithm starts
    x = np.linspace(x_from, x_to, width).reshape((1, width))
    y = np.linspace(y_from, y_to, height).reshape((height, 1))
    c = x + 1j * y

    # Initialize z to all zero
    z = np.zeros(c.shape, dtype=np.complex128)
    # To keep track in which iteration the point diverged
    div_time = np.zeros(z.shape, dtype=int)
    # To keep track on which points did not converge so far
    m = np.full(c.shape, True, dtype=bool)

    for i in range(max_iterations):
        z[m] = z[m]**2 + c[m]

        diverged = np.greater(np.abs(z), 2, out=np.full(c.shape, False), where=m) # Find diverging

        div_time[diverged] = i      # set the value of the diverged iteration number
        m[np.abs(z) > 2] = False    # to remember which have diverged
    return div_time


# Default image of Mandelbrot set
plt.imshow(mandelbrot(800, 1000), cmap='magma')
# The image below of Mandelbrot set
# plt.imshow(mandelbrot(800, 1000, -0.75, 0.0, 2, 200), cmap='magma')
# The image below of below of Mandelbrot set
# plt.imshow(mandelbrot(800, 1000, -1, 0.3, 20, 500), cmap='magma')
plt.show()

Notice that z[m] = z[m]**2 + c[m] only computes updates on values that are still not diverged.

I have added the following two images from above (the one not commented out is above in previous step.

Mandelbrot set from the program above.
Mandelbrot set from the code above.
Also check out the tutorial on Julia sets.

NumPy: How does Sexual Compulsivity Scale Correlate with Men, Women, or Age?

Background

According to wikipedia, the Sexual Compulsivity Scale (SCS) is a psychometric measure of high libido, hypersexuality, and sexual addiction. While it does not say anything about the score itself, it is based on people rating 10 questions from 1 to 4.

The questions are the following.

Q1. My sexual appetite has gotten in the way of my relationships.				
Q2. My sexual thoughts and behaviors are causing problems in my life.				
Q3. My desires to have sex have disrupted my daily life.				
Q4. I sometimes fail to meet my commitments and responsibilities because of my sexual behaviors.				
Q5. I sometimes get so horny I could lose control.				
Q6. I find myself thinking about sex while at work.				
Q7. I feel that sexual thoughts and feelings are stronger than I am.				
Q8. I have to struggle to control my sexual thoughts and behavior.				
Q9. I think about sex more than I would like to.				
Q10. It has been difficult for me to find sex partners who desire having sex as much as I want to.

The questions are rated as follows (1=Not at all like me, 2=Slightly like me, 3=Mainly like me, 4=Very much like me).

A dataset of more than 3300+ responses can be found here, which includes the individual rating of each questions, the total score (the sum of ratings), age and gender.

Step 1: First inspection of the data.

Inspection of the data (CSV file)

The first question that pops into my mind is how men and women rate themselves differently. How can we efficiently figure that out?

Welcome to NumPy. It has a built-in csv reader that does all the hard work in the genfromtxt function.

import numpy as np

data = np.genfromtxt('scs.csv', delimiter=',', dtype='int')

# Skip first row as it has description
data = data[1:]

men = data[data[:,11] == 1]
women = data[data[:,11] == 2]

print("Men average", men.mean(axis=0))
print("Women average", women.mean(axis=0))

Dividing into men and women is easy with NumPy, as you can make a vectorized conditional inside the dataset. Men are coded with 1 and women with 2 in column 11 (the 12th column). Finally, a call to mean will do the rest.

Men average [ 2.30544662  2.2453159   2.23485839  1.92636166  2.17124183  3.06448802
  2.19346405  2.28496732  2.43660131  2.54204793 23.40479303  1.
 32.54074074]
Women average [ 2.30959164  2.18993352  2.19088319  1.95916429  2.38746439  3.13010446
  2.18518519  2.2991453   2.4985755   2.43969611 23.58974359  2.
 27.52611586]

Interestingly, according to this dataset (which should be accounted for accuracy, where 21% of answers were not used) women are scoring slighter higher SCS than men.

Men rate highest on the following question:

Q6. I find myself thinking about sex while at work.

While women rate highest on this question.

Q6. I find myself thinking about sex while at work.

The same. Also the lowest is the same for both genders.

Q4. I sometimes fail to meet my commitments and responsibilities because of my sexual behaviors.

Step 2: Visualize age vs score

I would guess that the SCS score decreases with age. Let’s see if that is the case.

Again, NumPy can do the magic easily. That is prepare the data. To visualize it we use matplotlib, which is a comprehensive library for creating static, animated, and interactive visualizations in Python.

import numpy as np
import matplotlib.pyplot as plt

data = np.genfromtxt('scs.csv', delimiter=',', dtype='int')

# Skip first row as it has description
data = data[1:]

score = data[:,10]
age = data[:,12]
age[age > 100] = 0

plt.scatter(age, score, alpha=0.05)
plt.show()

Resulting in this plot.

Age vs SCS score.

It actually does not look like any correlation. Remember, there are more young people responding to the survey.

Let’s ask NumPy what it thinks about correlation here? Luckily we can do that by calling the corrcoef function, which calculates the Pearson product-moment correlation coefficients.

print("Correlation of age and SCS score:", np.corrcoef(age, score))

Resulting in this output.

Correlation of age and SCS score:
[[1.         0.01046882]
 [0.01046882 1.        ]]

Saying no correlation, as 0.0 – 0.3 is a small correlation, hence, 0.01046882 is close to none. Does that mean the the SCS score does not correlate with age? That our SCS score is static through life?

I do not think we can conclude that based on this small dataset.

Step 3: Bar plot the distribution of scores

It also looked like in the graph we plotted that there was a close to even distribution of scores.

Let’s try to see that. Here we need to sum participants by group. NumPy falls a bit short here. But let’s keep the good mood and use plain old Python lists.

import numpy as np
import matplotlib.pyplot as plt

data = np.genfromtxt('scs.csv', delimiter=',', dtype='int')

# Skip first row as it has description
data = data[1:]

scores = []
numbers = []
for i in range(10, 41):
    numbers.append(i)
    scores.append(data[data[:, 10] == i].shape[0])

plt.bar(numbers, scores)
plt.show()

Resulting in this bar plot.

Count participants by score.

We knew that the average score was around 23, which could give a potential evenly distribution. But it seems to be a little lower in the far high end of SCS score.

For another great tutorial on NumPy check this one out, or learn some differences between NumPy and Pandas.

NumPy: Analyse Narcissistic Personality Indicator Numerical Dataset

What is Narcissistic Personality Indicator and how does it connect to NumPy?

NumPy is an amazing library that makes analyzing data easy, especially numerical data.

In this tutorial we are going to analyze a survey with 11.000+ respondents from an interactive Narcissistic Personality Indicator (NPI) test.

Narcissism in personality trait generally conceived of as excessive self love. In Greek mythology Narcissus was a man who fell in love with his reflection in a pool of water.

https://openpsychometrics.org/tests/NPI/

The only connection between NPI and NumPy is that we want to analyze the 11.000+ answers.

The dataset can be downloaded here, which consists of a comma separated file, or CSV file for short and a description.

Step 1: Import the dataset and explore it

NumPy has thought of it for us, as simple as magic to load the dataset (in from the link above).

import numpy as np

# This magic line loads the 11.000+ lines of data to a ndarray
data = np.genfromtxt('data.csv', delimiter=',', dtype='int')

# Skip first row
data = data[1:]
print(data)

And we print a summary out.

[[ 18   2   2 ... 211   1  50]
 [  6   2   2 ... 149   1  40]
 [ 27   1   2 ... 168   1  28]
 ...
 [  6   1   2 ... 447   2  33]
 [ 12   2   2 ... 167   1  24]
 [ 18   1   2 ... 291   1  36]]

A good idea is to investigate it from a spreadsheet as well to investigate it.

Spreadsheet

And the far end.

Spreadsheet

Oh, that end.

Then investigate the description from the dataset. (Here we have some of it).

For questions 1=40 which choice they chose was recorded per the following key.
... [The questions Q1 ... Q40]
...
gender. Chosen from a drop down list (1=male, 2=female, 3=other; 0=none was chosen).
age. Entered as a free response. Ages below 14 have been ommited from the dataset.

-- CALCULATED VALUES --
elapse. (time submitted)-(time loaded) of the questions page in seconds.
score. = ((int) $_POST['Q1'] == 1)
... [How it is calculated]

That means we score, answers to questions, elapsed time to answer, gender and age.

Reading a bit more, it says that a high score is an indicator for having narcissistic traits, but one should not conclude that it is one.

Step 2: Men or Women highest NPI?

I’m glad you asked.

import numpy as np

data = np.genfromtxt('data.csv', delimiter=',', dtype='int')

# Skip first row
data = data[1:]
# Extract all the NPI scores (first column)
npi_score = data[:,0]

print("Average score", npi_score.mean())
print("Men average", npi_score[data[:,42] == 1].mean())
print("Women average", npi_score[data[:,42] == 2].mean())
print("None average", npi_score[data[:,42] == 0].mean())
print("Other average", npi_score[data[:,42] == 3].mean())

Before looking at the result, see how nice the data the first column is sliced out to the view in npi_score. Then notice how easy you can calculate the mean based on a conditional rules to narrow the view.

Average score 13.29965311749533
Men average 14.195953307392996
Women average 12.081829626521191
None average 11.916666666666666
Other average 14.85

I guess you guessed it. Men score higher.

Step 3: Is there a correlation between age and NPI score?

I wonder about that too.

How can we figure that out? Wait, let’s ask our new friend NumPy.

import numpy as np
import matplotlib.pyplot as plt

data = np.genfromtxt('data.csv', delimiter=',', dtype='int')

# Skip first row
data = data[1:]
# Extract all the NPI scores (first column)
npi_score = data[:,0]
age = data[:,43]
# Some age values are not real, so we adjust them to 0
age[age>100] = 0

# Scatter plot them all with alpha=0.05
plt.scatter(age, npi_score, color='r', alpha=0.05)
plt.show()

Resulting in.

Plotting age vs NPI

That looks promising. But can we just conclude that younger people score higher NPI?

What if most respondent are young, then that would make the picture more dense in the younger end (15-30). The danger with your eye is making fast conclusions.

Luckily, NumPy can help us there as well.

print(np.corrcoef(npi_score, age))

Resulting in.

Correlation of NPI score and age:
[[ 1.         -0.23414633]
 [-0.23414633  1.        ]]

What does that mean? Well, looking at the documentation of np.corroef():

Return Pearson product-moment correlation coefficients.

https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html

It has a negative correlation, which means that the younger the higher NPI score. Values between 0.0 and -0.3 are considered low.

Is the Pearson product-moment correlation the correct one to use?

Step 4: (Optional) Let’s try to see if there is a correlation between NPI score and time elapsed

Same code, different column.

import numpy as np
import matplotlib.pyplot as plt


data = np.genfromtxt('data.csv', delimiter=',', dtype='int')

# Skip first row
data = data[1:]
# Extract all the NPI scores (first column)
npi_score = data[:,0]
elapse = data[:,41]
elapse[elapse > 2000] = 2000

# Scatter plot them all with alpha=0.05
plt.scatter(elapse, npi_score, color='r', alpha=0.05)
plt.show()

Resulting in.

Time elapsed in seconds and NPI score

Again, it is tempting to conclude something here. We need to remember that the mean value is around 13, hence, most data will be around there.

If we use the same calculation.

print("Correlation of NPI score and time elapse:")
print(np.corrcoef(npi_score, elapse))

Output.

Correlation of NPI score and time elapse:
[[1.        0.0147711]
 [0.0147711 1.       ]]

Hence, here the there is close to no correlation.

Conclusion

Use the scientific tools to conclude. Do not rely on you eyes to determine whether there is a correlation.

The above gives an idea on how easy it is to work with numerical data in NumPy.

Deleting Elements of a Python List while Iterating

What will we cover in this tutorial?

  • Understand the challenge with deleting elements while iterating over a Python list.
  • How to delete element from a Python list while iterating over it.

Step 1: What happens when you just delete elements from a Python list while iterating over it?

Let’s first try this simple example to understand the challenge of deleting element in a Python list while iterating over it.

a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
for e in a:
  a.remove(e)
print(a)

Now, looking at this piece of code, it would seem to be intended to delete all elements. But that is not happening. See, the output is.

[1, 3, 5, 7, 9]

Seems like every second element is deleted. Right?

Let’s try to understand that. When we enter the the loop we see the following view.

for e (= 0, first element) in a (= [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]):
  a.remove(e)

Then the first element is removed on the second line, then the view is.

for e (= 0, first element) in a (= [1, 2, 3, 4, 5, 6, 7, 8, 9]):
  a.remove(e) (a = [1, 2, 3, 4, 5, 6, 7, 8, 9])

Going into the second iteration it looks like this.

for e (= 2, second element) in a (= [1, 2, 3, 4, 5, 6, 7, 8, 9]):
  a.remove(e)

Hence, we see that the iterator takes the second element, which now is the number 2.

This explains why the every second number is deleted from the list.

Step 2: What if we use index instead

Good idea. Let’s see what happens.

a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
for i, e in enumerate(a):
  a.pop(i)
print(a)

Which results in the same.

[1, 3, 5, 7, 9]

What if we iterate directly over the index by using the length of the list.

a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
for i in range(len(a)):
  a.pop(i)
print(a)

Oh, no.

Traceback (most recent call last):
  File "main.py", line 3, in <module>
    a.pop(i)
IndexError: pop index out of range

I get it. It is because the len(a) is invoked in the first iteration and results to 10. Then when we reach i = 5, we have already pop’ed 5 elements and have only 5 elements left. Hence, out of bound.

Not convinced?

a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
for i in range(len(a)):
  print(i, len(a), a)
  a.pop(i)
print(a)

Resulting to.

0 10 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
1 9 [1, 2, 3, 4, 5, 6, 7, 8, 9]
2 8 [1, 3, 4, 5, 6, 7, 8, 9]
3 7 [1, 3, 5, 6, 7, 8, 9]
4 6 [1, 3, 5, 7, 8, 9]
5 5 [1, 3, 5, 7, 9]
Traceback (most recent call last):
  File "main.py", line 4, in <module>
    a.pop(i)
IndexError: pop index out of range

But what to do?

Step 3: How to delete elements while iterating over a list

The problem we want to solve is not to delete all the element. It is to delete entries based on their values or some conditions, where we need to interpret the values of the elements.

How can we do that?

By using list comprehension or by making a copy. Or is it the same, as list comprehension is creating a new copy, right?

Okay, one step at the time. Just see the following example.

a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
a = [i for i in a if i % 2 == 0]
print(a)

Resulting in a copy of the the original list with only the even elements.

[0, 2, 4, 6, 8]

To see it is a copy you can evaluate the following code.

a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
b = a
a = [i for i in a if i % 2 == 0]
print(a)
print(b)

Resulting in the following, where you see the variable a get’s a new copy of it and the variable b refers to the original (and unmodified version).

[0, 2, 4, 6, 8]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Hence, the effect of the list comprehension construction above is as the following code shows.

a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

# a = [i for i in a if i % 2 == 0]
c = []
for i in a:
    if i % 2 == 0:
        c.append(i)
a = c
print(a)

Getting the what you want.

[0, 2, 4, 6, 8]

Next steps

You can make the criteria more advanced by making the criteria by a function call.

def criteria(v):
  # some advanced code that returns True of False

a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
a = [i for i in a if criteria(i)]

And if you want to keep a state of all previous criteria, then you can even use an Object to keep that stored.

class State:
  # ...
  def criteria(self, v):
    # ...

s = State()
a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
a = [i for i in a if s.criteria(i)]

Also, check out this tutorial that makes some observations on performance on list comprehensions.