Pandas: Explore Datasets by Visualization – Exploring the Holland Code (RIASEC) Test

What will we cover in this tutorial

We will explore a dataset with the Holland Code (RIASEC) Test, which is a test that should predict careers and vocational choices by rating questions.

In this part of the exploration, we first focus on loading the data and visualizing where the respondents come from. The dataset contains more than 145,000 responses.

You can download the dataset here.

Step 1: First glance at the data

Let us first try to see what the data contains.

Reading the codebook (the file with the dataset) you see it contains ratings of questions of the 6 categories RIASEC. Then there are 3 elapsed times for the test.

There is a ratings of The Ten Item Personality Inventory. Then a self assessment whether they know 16 words. Finally, a list if metadata on them, like where the respondent network was located (which is a indicator on where the respondent was located in most cases).

Other metadata can be seen explained here.

education			"How much education have you completed?", 1=Less than high school, 2=High school, 3=University degree, 4=Graduate degree
urban				"What type of area did you live when you were a child?", 1=Rural (country side), 2=Suburban, 3=Urban (town, city)
gender				"What is your gender?", 1=Male, 2=Female, 3=Other
engnat				"Is English your native language?", 1=Yes, 2=No
age					"How many years old are you?"
hand				"What hand do you use to write with?", 1=Right, 2=Left, 3=Both
religion			"What is your religion?", 1=Agnostic, 2=Atheist, 3=Buddhist, 4=Christian (Catholic), 5=Christian (Mormon), 6=Christian (Protestant), 7=Christian (Other), 8=Hindu, 9=Jewish, 10=Muslim, 11=Sikh, 12=Other
orientation			"What is your sexual orientation?", 1=Heterosexual, 2=Bisexual, 3=Homosexual, 4=Asexual, 5=Other
race				"What is your race?", 1=Asian, 2=Arab, 3=Black, 4=Indigenous Australian / Native American / White, 5=Other (There was a coding error in the survey, and three different options were given the same value)
voted				"Have you voted in a national election in the past year?", 1=Yes, 2=No
married				"What is your marital status?", 1=Never married, 2=Currently married, 3=Previously married
familysize			"Including you, how many children did your mother have?"		
major				"If you attended a university, what was your major (e.g. "psychology", "English", "civil engineering")?"

These values were also calculated for technical information:
uniqueNetworkLocation	1 if the record is the only one from its network location in the dataset, 2 if there are more than one record. There can be more than one record from the same network if for example that network is shared by a school etc, or it may be because of test retakes
country	The country of the network the user connected from
source	1=from Google, 2=from an internal link on the website, 0=from any other website or could not be determined

Step 2: Loading the data into a DataFrame (Pandas)

First step would be to load the data into a DataFrame. If you are new to Pandas DataFrame, we can recommend this tutorial.

import pandas as pd

pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 150)
data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
print(data)

The pd.set_option are only to help get are more rich output, compared to a very small and narrow summary. The actual loading of the data is done by pd.read_csv(…).

Notice that we have renamed the csv file to riasec.csv. As it is a tab-spaced csv, we need to parse that as an argument if it is not using the default comma.

The output from the above code is.

        R1  R2  R3  R4  R5  ...  uniqueNetworkLocation  country  source                major  Unnamed: 93
0        3   4   3   1   1  ...                      1       US       2                  NaN          NaN
1        1   1   2   4   1  ...                      1       US       1              Nursing          NaN
2        2   1   1   1   1  ...                      1       US       1                  NaN          NaN
3        3   1   1   2   2  ...                      1       CN       0                  NaN          NaN
4        4   1   1   2   1  ...                      1       PH       0            education          NaN
...     ..  ..  ..  ..  ..  ...                    ...      ...     ...                  ...          ...
145823   2   1   1   1   1  ...                      1       US       1        Communication          NaN
145824   1   1   1   1   1  ...                      1       US       1              Biology          NaN
145825   1   1   1   1   1  ...                      1       US       2                  NaN          NaN
145826   3   4   4   5   2  ...                      2       US       0                  yes          NaN
145827   2   4   1   4   2  ...                      1       US       1  Information systems          NaN

Interestingly, the dataset contains an unnamed last column with no data. That is because it ends each line with a tab (\t) before new line (\n).

We could clean that up, but as we are only interested in the country counts, we will ignore it in this tutorial.

Step 3: Count the occurrences of each country

As said, we are only interested in this first tutorial on this dataset to get an idea of where the respondents come from in the world.

The data is located in the ‘country’ column of the DataFrame data.

To group the data, you can use groupby(), which will return af DataFrameGroupBy object. If you apply a size() on that object, it will return a Series with sizes of each group.

print(data.groupby(['country']).size())

Where the first few lines are.

country
AD          2
AE        507
AF          8
AG          7
AL        116
AM         10

Hence, for each country we will have a count of how many respondents came from that country.

Step 4: Understand the map data we want to merge it with

To visualize the data, we need some way to have a map.

Here the GeoPandas comes in handy. It contains a nice low-res map of the world you can use.

Let’s just explore that.

import geopandas
import matplotlib.pyplot as plt
world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
world.plot()
plt.show()

Which will make the following map.

World map using GeoPandas and Maplotlib

This is too easy to be true. No, not really. This is the reality of Python.

We want to merge the data from out world map above with the data of counts for each country.

We need to see how to merge it. To do that let us look at the data from world.

world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
print(world)

Where the first few lines are.

        pop_est                continent                      name iso_a3   gdp_md_est                                           geometry
0        920938                  Oceania                      Fiji    FJI      8374.00  MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1      53950935                   Africa                  Tanzania    TZA    150600.00  POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2        603253                   Africa                 W. Sahara    ESH       906.50  POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
3      35623680            North America                    Canada    CAN   1674000.00  MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
4     326625791            North America  United States of America    USA  18560000.00  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...

First problem arises here. In the other dataset we have 2 letter country codes, in this one they use 3 letter country codes.

Step 5: Solving the merging problem

Luckily we can use a library called PyCountry.

Let’s add this 3 letter country code to our first dataset by using a lambda function. A lambda? New to lambda function, we recommend you read the this tutorial.

import pandas as pd
import pycountry

# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
    try:
        return pycountry.countries.lookup(country).alpha_3
    except LookupError:
        return country
data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)

Basically, we add a new column to the dataset and call it ‘alpha3’ with the three letter country code. We use the function apply, which takes the lambda function that actually calls the function outside, which calls the library.

The reason to so, is that sometimes the pycountry.contries calls makes a lookup exception. We want our program to be robust to that.

Now the data contains a row with the countries in 3 letters like world.

We can now merge the data together. Remember that the data we want to merge needs to be adjusted to be counting on ‘alpha3’ and also we want to convert it to a DataFrame (as size() returns a Series).

import geopandas
import pandas as pd
import pycountry

# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
    try:
        return pycountry.countries.lookup(country).alpha_3
    except LookupError:
        return country

data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)
country_count = data.groupby(['alpha3']).size().to_frame()
country_count.columns = ['count']
world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
map = world.merge(country_count, how='left', left_on=['iso_a3'], right_on=['alpha3'])
print(map)

The first few lines are given below.

        pop_est                continent                      name iso_a3   gdp_md_est                                           geometry    count  \
0        920938                  Oceania                      Fiji    FJI      8374.00  MULTIPOLYGON (((180.00000 -16.06713, 180.00000...     12.0   
1      53950935                   Africa                  Tanzania    TZA    150600.00  POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...      9.0   
2        603253                   Africa                 W. Sahara    ESH       906.50  POLYGON ((-8.66559 27.65643, -8.66512 27.58948...      NaN   
3      35623680            North America                    Canada    CAN   1674000.00  MULTIPOLYGON (((-122.84000 49.00000, -122.9742...   7256.0   
4     326625791            North America  United States of America    USA  18560000.00  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...  80579.0   
5      18556698                     Asia                Kazakhstan    KAZ    460700.00  POLYGON ((87.35997 49.21498, 86.59878 48.54918...     46.0   

Notice, that some countries do not have a count. Those a countries with no respondent.

Step 6: Ready to plot a world map

Now to the hard part, right?

Making a colorful map indicating the number of respondents in a given country.

import geopandas
import pandas as pd
import matplotlib.pyplot as plt
import pycountry
import numpy as np

# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
    try:
        return pycountry.countries.lookup(country).alpha_3
    except LookupError:
        return country

data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)
country_count = data.groupby(['alpha3']).size().to_frame()
country_count.columns = ['count']
world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
map = world.merge(country_count, how='left', left_on=['iso_a3'], right_on=['alpha3'])
map.plot('count', figsize=(10,3), legend=True)
plt.show()

It is easy. Just call plot(…) with the first argument to be the column to use. I also change the default figsize, you can play around with that. Finally I add the legend.

The output

Not really satisfying. The problem is that all counties, but USA, have almost identical colors. Looking at the data, you will see that it is because that there are so many respondents in USA that the countries are in the bottom of the scale.

What to do? Use a log-scale.

You can actually do that directly in your DataFrame. By using a NumPy library we can calculate a logarithmic scale.

See the magic.

import geopandas
import pandas as pd
import matplotlib.pyplot as plt
import pycountry
import numpy as np

# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
    try:
        return pycountry.countries.lookup(country).alpha_3
    except LookupError:
        return country

data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)
country_count = data.groupby(['alpha3']).size().to_frame()
country_count.columns = ['count']
country_count['log_count'] = np.log(country_count['count'])
world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
map = world.merge(country_count, how='left', left_on=['iso_a3'], right_on=['alpha3'])
map.plot('log_count', figsize=(10,3), legend=True)
plt.show()

Where the new magic is to add the log_count and using np.log(country_count[‘count’]).

Also notice that the plot is now done on ‘log_count’.

The final output.

Now you see more of a variety in the countries respondents. Note that the “white” countries did not have any respondent.

Read the next exploration of the dataset here.

Next exploration.

NumPy: Calculate the Julia Set with Vectorization

What will we cover in this tutorial?

In this tutorial you will learn what the Julia set is and understand how it is calculated. Also, how it translates into colorful images. In the process, we will learn how to utilize vectorization with NumPy arrays to achieve it.

Step 1: Understand the Julia set

Juila set are closely connect to the Mandelbrot set. If you are new to the Mandelbrot set, we recommend you read this tutorial before you proceed, as it will make it easier to understand.

Read this tutorial before if you are new to Mandelbrot and Julia sets.

Julia sets can be calculated for a function f. If we consider the function f_c(z) = z^2 + c, for a complex number c, then this function is used in the Mandelbrot set.

Recall the Mandelbrot set is calculated by identifying for a point c whether the function f_c(z) = z^2 + c , for which the sequence f_c(0), f_c(f_c(0)), f_c(f_c(f_c(0))), …., does not diverge.

Said differently, for each point c on the complex plane, if the sequence does not diverge, then that point is in the Mandelbrot set.

The Julia set has c fixed and and calculates the same sequence for z in the complex plane. That is, for each point z in the complex plane if the sequence f_c(0), f_c(f_c(0)), f_c(f_c(f_c(0))), …., does not diverge it is part of the Julia set.

Step 2: Pseudo code for Julia set of non-vectorization computation

The best way to understand is often to see the non-vectorization method to compute the Julia set.

As we consider the function f_c(z) = z^2 + c for our Julia set, we need to choose a complex number for c. Note, that complex number c can be set differently to get another Julia set.

Then each we can iterate over each point z in the complex plane.

c = -0.8 + i*0.34
for x in [-1, 1] do:
  for y in [-1, 1] do:
    z = x + i*y
    N = 0
    while absolute(z) < 2 and N < MAX_ITERATIONS:
      z = z^2 + c
    set color for x,y to N

This provides beautiful color images of the Julia set.

Julia set generated from the implementation below.

Step 3: The vectorization computation using NumPy arrays

How does that translate into code using NumPy?

import numpy as np
import matplotlib.pyplot as plt

def julia_set(c=-0.4 + 0.6j, height=800, width=1000, x=0, y=0, zoom=1, max_iterations=100):
    # To make navigation easier we calculate these values
    x_width = 1.5
    y_height = 1.5*height/width
    x_from = x - x_width/zoom
    x_to = x + x_width/zoom
    y_from = y - y_height/zoom
    y_to = y + y_height/zoom
    # Here the actual algorithm starts
    x = np.linspace(x_from, x_to, width).reshape((1, width))
    y = np.linspace(y_from, y_to, height).reshape((height, 1))
    z = x + 1j * y
    # Initialize z to all zero
    c = np.full(z.shape, c)
    # To keep track in which iteration the point diverged
    div_time = np.zeros(z.shape, dtype=int)
    # To keep track on which points did not converge so far
    m = np.full(c.shape, True, dtype=bool)
    for i in range(max_iterations):
        z[m] = z[m]**2 + c[m]
        m[np.abs(z) > 2] = False
        div_time[m] = i
    return div_time

plt.imshow(julia_set(), cmap='magma')
# plt.imshow(julia_set(x=0.125, y=0.125, zoom=10), cmap='magma')
# plt.imshow(julia_set(c=-0.8j), cmap='magma')
# plt.imshow(julia_set(c=-0.8+0.156j, max_iterations=512), cmap='magma')
# plt.imshow(julia_set(c=-0.7269 + 0.1889j, max_iterations=256), cmap='magma')
plt.show()
Generated from the code above.
Generated from the code above.

NumPy: Compute Mandelbrot set by Vectorization

What will we cover in this tutorial?

  • Understand what the Mandelbrot set it and why it is so fascinating.
  • Master how to make images in multiple colors of the Mandelbrot set.
  • How to implement it using NumPy vectorization.

Step 1: What is Mandelbrot?

Mandelbrot is a set of complex numbers for which the function f(z) = z^2 + c does not converge when iterated from z=0 (from wikipedia).

Take a complex number, c, then you calculate the sequence for N iterations:

z_(n+1) = z_n + c for n = 0, 1, …, N-1

If absolute(z_(N-1)) < 2, then it is said not to diverge and is part of the Mandelbrot set.

The Mandelbrot set is part of the complex plane, which is colored by numbers part of the Mandelbrot set and not.

Mandelbrot set.

This only gives a block and white colored image of the complex plane, hence often the images are made more colorful by giving it colors by the iteration number it diverged. That is if z_4 diverged for a point in the complex plane, then it will be given the color 4. That is how you end up with colorful maps like this.

Mandelbrot set (made by program from this tutorial).

Step 2: Understand the code of the non-vectorized approach to compute the Mandelbrot set

To better understand the images from the Mandelbrot set, think of the complex numbers as a diagram, where the real part of the complex number is x-axis and the imaginary part is y-axis (also called the Argand diagram).

Argand diagram

Then each point is a complex number c. That complex number will be given a color depending on which iteration it diverges (if it is not part of the Mandelbrot set).

Now the pseudocode for that should be easy to digest.

for x in [-2, 2] do:
  for y in [-1.5, 1.5] do:
    c = x + i*y
    z = 0
    N = 0
    while absolute(z) < 2 and N < MAX_ITERATIONS:
      z = z^2 + c
    set color for x,y to N

Simple enough to understand. That is some of the beauty of it. The simplicity.

Step 3: Make a vectorized version of the computations

Now we understand the concepts behind we should translate that into to a vectorized version. If you are new to vectorization we can recommend you read this tutorial first.

What do we achieve with vectorization? That we compute all the complex numbers simultaneously. To understand that inspect the initialization of all the points here.

import numpy as np
def mandelbrot(height, width, x_from=-2, x_to=1, y_from=-1.5, y_to=1.5, max_iterations=100):
    x = np.linspace(x_from, x_to, width).reshape((1, width))
    y = np.linspace(y_from, y_to, height).reshape((height, 1))
    c = x + 1j * y

You see that we initialize all the x-coordinates at once using the linespace. It will create an array with numbers from x_from to x_to in width points. The reshape is to fit the plane.

The same happens for y.

Then all the complex numbers are created in c = x + 1j*y, where 1j is the imaginary part of the complex number.

This leaves us to the full implementation.

There are two things we need to keep track of in order to make a colorful Mandelbrot set. First, in which iteration the point diverged. Second, to achieve that, we need to remember when a point diverged.

import numpy as np
import matplotlib.pyplot as plt

def mandelbrot(height, width, x=-0.5, y=0, zoom=1, max_iterations=100):
    # To make navigation easier we calculate these values
    x_width = 1.5
    y_height = 1.5*height/width
    x_from = x - x_width/zoom
    x_to = x + x_width/zoom
    y_from = y - y_height/zoom
    y_to = y + y_height/zoom
    # Here the actual algorithm starts
    x = np.linspace(x_from, x_to, width).reshape((1, width))
    y = np.linspace(y_from, y_to, height).reshape((height, 1))
    c = x + 1j * y
    # Initialize z to all zero
    z = np.zeros(c.shape, dtype=np.complex128)
    # To keep track in which iteration the point diverged
    div_time = np.zeros(z.shape, dtype=int)
    # To keep track on which points did not converge so far
    m = np.full(c.shape, True, dtype=bool)
    for i in range(max_iterations):
        z[m] = z[m]**2 + c[m]
        diverged = np.greater(np.abs(z), 2, out=np.full(c.shape, False), where=m) # Find diverging
        div_time[diverged] = i      # set the value of the diverged iteration number
        m[np.abs(z) > 2] = False    # to remember which have diverged
    return div_time

# Default image of Mandelbrot set
plt.imshow(mandelbrot(800, 1000), cmap='magma')
# The image below of Mandelbrot set
# plt.imshow(mandelbrot(800, 1000, -0.75, 0.0, 2, 200), cmap='magma')
# The image below of below of Mandelbrot set
# plt.imshow(mandelbrot(800, 1000, -1, 0.3, 20, 500), cmap='magma')
plt.show()

Notice that z[m] = z[m]**2 + c[m] only computes updates on values that are still not diverged.

I have added the following two images from above (the one not commented out is above in previous step.

Mandelbrot set from the program above.
Mandelbrot set from the code above.
Also check out the tutorial on Julia sets.

NumPy: How does Sexual Compulsivity Scale Correlate with Men, Women, or Age?

Background

According to wikipedia, the Sexual Compulsivity Scale (SCS) is a psychometric measure of high libido, hypersexuality, and sexual addiction. While it does not say anything about the score itself, it is based on people rating 10 questions from 1 to 4.

The questions are the following.

Q1. My sexual appetite has gotten in the way of my relationships.				
Q2. My sexual thoughts and behaviors are causing problems in my life.				
Q3. My desires to have sex have disrupted my daily life.				
Q4. I sometimes fail to meet my commitments and responsibilities because of my sexual behaviors.				
Q5. I sometimes get so horny I could lose control.				
Q6. I find myself thinking about sex while at work.				
Q7. I feel that sexual thoughts and feelings are stronger than I am.				
Q8. I have to struggle to control my sexual thoughts and behavior.				
Q9. I think about sex more than I would like to.				
Q10. It has been difficult for me to find sex partners who desire having sex as much as I want to.

The questions are rated as follows (1=Not at all like me, 2=Slightly like me, 3=Mainly like me, 4=Very much like me).

A dataset of more than 3300+ responses can be found here, which includes the individual rating of each questions, the total score (the sum of ratings), age and gender.

Step 1: First inspection of the data.

Inspection of the data (CSV file)

The first question that pops into my mind is how men and women rate themselves differently. How can we efficiently figure that out?

Welcome to NumPy. It has a built-in csv reader that does all the hard work in the genfromtxt function.

import numpy as np
data = np.genfromtxt('scs.csv', delimiter=',', dtype='int')
# Skip first row as it has description
data = data[1:]
men = data[data[:,11] == 1]
women = data[data[:,11] == 2]
print("Men average", men.mean(axis=0))
print("Women average", women.mean(axis=0))

Dividing into men and women is easy with NumPy, as you can make a vectorized conditional inside the dataset. Men are coded with 1 and women with 2 in column 11 (the 12th column). Finally, a call to mean will do the rest.

Men average [ 2.30544662  2.2453159   2.23485839  1.92636166  2.17124183  3.06448802
  2.19346405  2.28496732  2.43660131  2.54204793 23.40479303  1.
 32.54074074]
Women average [ 2.30959164  2.18993352  2.19088319  1.95916429  2.38746439  3.13010446
  2.18518519  2.2991453   2.4985755   2.43969611 23.58974359  2.
 27.52611586]

Interestingly, according to this dataset (which should be accounted for accuracy, where 21% of answers were not used) women are scoring slighter higher SCS than men.

Men rate highest on the following question:

Q6. I find myself thinking about sex while at work.

While women rate highest on this question.

Q6. I find myself thinking about sex while at work.

The same. Also the lowest is the same for both genders.

Q4. I sometimes fail to meet my commitments and responsibilities because of my sexual behaviors.

Step 2: Visualize age vs score

I would guess that the SCS score decreases with age. Let’s see if that is the case.

Again, NumPy can do the magic easily. That is prepare the data. To visualize it we use matplotlib, which is a comprehensive library for creating static, animated, and interactive visualizations in Python.

import numpy as np
import matplotlib.pyplot as plt
data = np.genfromtxt('scs.csv', delimiter=',', dtype='int')
# Skip first row as it has description
data = data[1:]
score = data[:,10]
age = data[:,12]
age[age > 100] = 0
plt.scatter(age, score, alpha=0.05)
plt.show()

Resulting in this plot.

Age vs SCS score.

It actually does not look like any correlation. Remember, there are more young people responding to the survey.

Let’s ask NumPy what it thinks about correlation here? Luckily we can do that by calling the corrcoef function, which calculates the Pearson product-moment correlation coefficients.

print("Correlation of age and SCS score:", np.corrcoef(age, score))

Resulting in this output.

Correlation of age and SCS score:
[[1.         0.01046882]
 [0.01046882 1.        ]]

Saying no correlation, as 0.0 – 0.3 is a small correlation, hence, 0.01046882 is close to none. Does that mean the the SCS score does not correlate with age? That our SCS score is static through life?

I do not think we can conclude that based on this small dataset.

Step 3: Bar plot the distribution of scores

It also looked like in the graph we plotted that there was a close to even distribution of scores.

Let’s try to see that. Here we need to sum participants by group. NumPy falls a bit short here. But let’s keep the good mood and use plain old Python lists.

import numpy as np
import matplotlib.pyplot as plt
data = np.genfromtxt('scs.csv', delimiter=',', dtype='int')
# Skip first row as it has description
data = data[1:]
scores = []
numbers = []
for i in range(10, 41):
    numbers.append(i)
    scores.append(data[data[:, 10] == i].shape[0])
plt.bar(numbers, scores)
plt.show()

Resulting in this bar plot.

Count participants by score.

We knew that the average score was around 23, which could give a potential evenly distribution. But it seems to be a little lower in the far high end of SCS score.

For another great tutorial on NumPy check this one out, or learn some differences between NumPy and Pandas.