Pandas: Explore Datasets by Visualization – Exploring the Holland Code (RIASEC) Test – Part IV

What will we cover in this tutorial?

We will continue our journey to explore a big dataset of 145,000+ respondents to a RIASEC test. If you want to explore the full journey, we recommend you read this tutorial first as well as the second part of the tutorial, and finally, the third part before continuing.

In this part we will investigate if we can see any correlation between the major of education and the 6 dimensions of the personality types in RIASEC.

Step 1: Group into major of educations

This is getting tricky, as the majors are typed in by the respondent. We will be missing some connections between them.

But let’s start by exploring them.

import pandas as pd

data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
major = data.loc[:,['major']]
print(major.groupby('major').size().sort_values(ascending=False))

The output is given here.

major
psychology                6861
Psychology                5763
English                   2342
Business                  2290
Biology                   1289
                          ... 
Sociology, Social work       1
Sociology, Psychology        1
Sociology, Math              1
Sociology, Linguistics       1
Nuerobiology                 1
Length: 15955, dtype: int64

Where we identify one problem, that some write with lowercase and others with uppercase.

Step 2: Clean up a few ambiguities

The first step would be to lowercase everything.

import pandas as pd

data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
major = data.loc[:,['major']]
major['major'] = major['major'].str.lower()
print(major.groupby('major').size().sort_values(ascending=False).iloc[:10])

Now printing the 10 first lines.

major
psychology          12766
business             3496
english              3042
nursing              2142
biology              1961
education            1800
engineering          1353
accounting           1186
computer science     1159
psychology           1098
dtype: int64

Where we notice that psychology is the first and last. Inspecting it further, it seems the the last one has a space after it. Hence, we can try to remove whitespaces around all educations.

import pandas as pd
import numpy as np

data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
major = data.loc[:,['major']]
major['major'] = major['major'].str.lower()
major['major'] = major.apply(lambda row: row['major'].strip() if row['major'] is not np.nan else np.nan, axis=1)
print(major.groupby('major').size().sort_values(ascending=False).iloc[:10])

Now the output is as follows.

major
psychology          13878
business             3848
english              3240
nursing              2396
biology              2122
education            1954
engineering          1504
accounting           1292
computer science     1240
law                  1111
dtype: int64

Introducing law at the bottom of the list.

This process could continue, but let’s keep the focus on these 10 highest representative educations in the dataset. Obviously, further data points could be added if investigating it further.

Step 3: See if education correlates to known words

First let’s explore the dataset a bit more. The respondents are asked if they know the definitions of the following words.

  • boat
  • incoherent
  • pallid
  • robot
  • audible
  • cuivocal
  • paucity
  • epistemology
  • florted
  • decide
  • pastiche
  • verdid
  • abysmal
  • lucid
  • betray
  • funny

Each word they know they mark. Hence, we can count the number of words each respondent knows and calculate an average per major group.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['VCL'] = data['VCL1'] + data['VCL2'] + data['VCL3'] + data['VCL4'] + data['VCL5'] + data['VCL6'] + data['VCL7'] + data['VCL8'] + data['VCL9'] + data['VCL10'] + data['VCL11'] + data['VCL12'] + data['VCL13'] + data['VCL14'] + data['VCL15'] + data['VCL16']
view = data.loc[:, ['VCL', 'major']]
view['major'] = view['major'].str.lower()
view['major'] = view.apply(lambda row: row['major'].strip() if row['major'] is not np.nan else np.nan, axis=1)

view = view.groupby('major').aggregate(['mean', 'count'])
view = view[view['VCL','count'] > 1110]
view.loc[:,('VCL','mean')].plot(kind='barh', figsize=(14,5))
plt.show()

Which results in the following output.

Average number of the 16 words that each major knows.

The Engineers seem to score lower than nursing. Well, I am actually surprised that Computer Science scores that high.

Step 4: Adding it all up together

Let’s use what we did in previous tutorial and use the calculations from there.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

def sum_dimension(data, letter):
    return data[letter + '1'] + data[letter + '2'] + data[letter + '3'] + data[letter + '4'] + data[letter + '5'] + data[letter + '6'] + data[letter + '7'] + data[letter + '8']

data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['R'] = sum_dimension(data, 'R')
data['I'] = sum_dimension(data, 'I')
data['A'] = sum_dimension(data, 'A')
data['S'] = sum_dimension(data, 'S')
data['E'] = sum_dimension(data, 'E')
data['C'] = sum_dimension(data, 'C')
data['VCL'] = data['VCL1'] + data['VCL2'] + data['VCL3'] + data['VCL4'] + data['VCL5'] + data['VCL6'] + data['VCL7'] + data['VCL8'] + data['VCL9'] + data['VCL10'] + data['VCL11'] + data['VCL12'] + data['VCL13'] + data['VCL14'] + data['VCL15'] + data['VCL16']
view = data.loc[:, ['R', 'I', 'A', 'S', 'E', 'C', 'VCL', 'major']]
view['major'] = view['major'].str.lower()
view['major'] = view.apply(lambda row: row['major'].strip() if row['major'] is not np.nan else np.nan, axis=1)

view = view.groupby('major').aggregate(['mean', 'count'])
view = view[view['VCL','count'] > 1110]
view.loc[:,[('R','mean'), ('I','mean'),('A','mean'), ('S','mean'),('C','mean'), ('C','mean')]].plot(kind='barh', figsize=(14,5))
plt.show()

Which results in the following diagram.

Correlation between major and RIASEC personality traits

Biology has high I (Investigative, people that prefer to work with data). While the R (Realistic, People who like to work with things) is dominated by Engineers and Computer Scientist.

Hmm… I should have noticed that many have major education.

3 Steps to Plot Shooting Incident in NY on a Map Using Python

What will you learn in this tutorial?

  • Where to find interesting data contained in CSV files.
  • How to extract a map to plot the data on.
  • Use Python to easily plot the data from the CSV file no the map.

Step 1: Collect the data in CSV format

You can find various interesting data in CSV format on data.world that you can play around with in Python.

In this tutorial we will focus on Shooting Incidents in NYPD from the last year. You can find the data on data.world.

data.world with NYPD Shooting Incident Data (Year To Date)
data.world with NYPD Shooting Incident Data (Year To Date)

You can download the CSV file containing all the data by pressing on the download link.

To download CSV file press the download.
To download CSV file press the download.

Looking at the data you see that each incident has latitude and longitude coordinates.

{'INCIDENT_KEY': '184659172', 'OCCUR_DATE': '06/30/2018 12:00:00 AM', 'OCCUR_TIME': '23:41:00', 'BORO': 'BROOKLYN', 'PRECINCT': '75', 'JURISDICTION_CODE': '0', 'LOCATION_DESC': 'PVT HOUSE                     ', 'STATISTICAL_MURDER_FLAG': 'false', 'PERP_AGE_GROUP': '', 'PERP_SEX': '', 'PERP_RACE': '', 'VIC_AGE_GROUP': '25-44', 'VIC_SEX': 'M', 'VIC_RACE': 'BLACK', 'X_COORD_CD': '1020263', 'Y_COORD_CD': '184219', 'Latitude': '40.672250312', 'Longitude': '-73.870176252'}

That means we can plot on a map. Let us try to do that.

Step 2: Export a map to plot the data

We want to plot all the shooting incidents on a map. You can use OpenStreetMap to get an image of a map.

We want a map of New York, which you can find by locating it on OpenStreetMap or pressing the link.

OpenStreetMap (sorry for the Danish language)

You should press the blue Download in the low right corner of the picture.

Also, remember to get the coordinates of the image in the left side bar, we will need them for the plot.

map_box = [-74.4461, -73.5123, 40.4166, 41.0359]

Step 3: Writing the Python code that adds data to the map

Importing data from a CVS file is easy and can be done through the standard library csv. Making plot on a graph can be done in matplotlib. If you do not have it installed already, you can do that by typing the following in a command line (or see here).

pip install matplotlib

First you need to transform the CVS data of the longitude and latitude to floats.

import csv

# The name of the input file might need to be adjusted, or the location needs to be added if it is not located in the same folder as this file.
csv_file = open('nypd-shooting-incident-data-year-to-date-1.csv')
csv_reader = csv.DictReader(csv_file)
longitude = []
latitude = []
for row in csv_reader:
    longitude.append(float(row['Longitude']))
    latitude.append(float(row['Latitude']))

Now you have two lists (longitude and latitude), which contains the coordinates to plot.

Then for the actual plotting into the image.

import matplotlib.pyplot as plt

# The boundaries of the image map
map_box = [-74.4461, -73.5123, 40.4166, 41.0359]
# The name of the image of the New York map might be different.
map_img = plt.imread('map.png')
fig, ax = plt.subplots()
ax.scatter(longitude, latitude)
ax.set_ylim(map_box[2], map_box[3])
ax.set_xlim(map_box[0], map_box[1])
ax.imshow(map_img, extent=map_box, alpha=0.9)

plt.savefig("mad_mod.png")
plt.show()

This will result in the following beautiful map of New York, which highlights where the shooting in the last year has occurred.

Shootings in New York in the last year. Plot by Python using matplotlib.
Shootings in New York in the last year. Plot by Python using matplotlib.

Now that is awesome. If you want to learn more, this and more is covered in my online course. Check it out.

You can also read about how to plot the mood of tweets on a leaflet map.