What will we cover in this tutorial?
We will continue our journey to explore a big dataset of 145,000+ respondents to a RIASEC test. If you want to explore the full journey, we recommend you read this tutorial first as well as the second part of the tutorial, and finally, the third part before continuing.
In this part we will investigate if we can see any correlation between the major of education and the 6 dimensions of the personality types in RIASEC.
Step 1: Group into major of educations
This is getting tricky, as the majors are typed in by the respondent. We will be missing some connections between them.
But let’s start by exploring them.
import pandas as pd
data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
major = data.loc[:,['major']]
print(major.groupby('major').size().sort_values(ascending=False))
The output is given here.
major
psychology 6861
Psychology 5763
English 2342
Business 2290
Biology 1289
...
Sociology, Social work 1
Sociology, Psychology 1
Sociology, Math 1
Sociology, Linguistics 1
Nuerobiology 1
Length: 15955, dtype: int64
Where we identify one problem, that some write with lowercase and others with uppercase.
Step 2: Clean up a few ambiguities
The first step would be to lowercase everything.
import pandas as pd
data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
major = data.loc[:,['major']]
major['major'] = major['major'].str.lower()
print(major.groupby('major').size().sort_values(ascending=False).iloc[:10])
Now printing the 10 first lines.
major
psychology 12766
business 3496
english 3042
nursing 2142
biology 1961
education 1800
engineering 1353
accounting 1186
computer science 1159
psychology 1098
dtype: int64
Where we notice that psychology is the first and last. Inspecting it further, it seems the the last one has a space after it. Hence, we can try to remove whitespaces around all educations.
import pandas as pd
import numpy as np
data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
major = data.loc[:,['major']]
major['major'] = major['major'].str.lower()
major['major'] = major.apply(lambda row: row['major'].strip() if row['major'] is not np.nan else np.nan, axis=1)
print(major.groupby('major').size().sort_values(ascending=False).iloc[:10])
Now the output is as follows.
major
psychology 13878
business 3848
english 3240
nursing 2396
biology 2122
education 1954
engineering 1504
accounting 1292
computer science 1240
law 1111
dtype: int64
Introducing law at the bottom of the list.
This process could continue, but let’s keep the focus on these 10 highest representative educations in the dataset. Obviously, further data points could be added if investigating it further.
Step 3: See if education correlates to known words
First let’s explore the dataset a bit more. The respondents are asked if they know the definitions of the following words.
- boat
- incoherent
- pallid
- robot
- audible
- cuivocal
- paucity
- epistemology
- florted
- decide
- pastiche
- verdid
- abysmal
- lucid
- betray
- funny
Each word they know they mark. Hence, we can count the number of words each respondent knows and calculate an average per major group.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['VCL'] = data['VCL1'] + data['VCL2'] + data['VCL3'] + data['VCL4'] + data['VCL5'] + data['VCL6'] + data['VCL7'] + data['VCL8'] + data['VCL9'] + data['VCL10'] + data['VCL11'] + data['VCL12'] + data['VCL13'] + data['VCL14'] + data['VCL15'] + data['VCL16']
view = data.loc[:, ['VCL', 'major']]
view['major'] = view['major'].str.lower()
view['major'] = view.apply(lambda row: row['major'].strip() if row['major'] is not np.nan else np.nan, axis=1)
view = view.groupby('major').aggregate(['mean', 'count'])
view = view[view['VCL','count'] > 1110]
view.loc[:,('VCL','mean')].plot(kind='barh', figsize=(14,5))
plt.show()
Which results in the following output.

The Engineers seem to score lower than nursing. Well, I am actually surprised that Computer Science scores that high.
Step 4: Adding it all up together
Let’s use what we did in previous tutorial and use the calculations from there.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
def sum_dimension(data, letter):
return data[letter + '1'] + data[letter + '2'] + data[letter + '3'] + data[letter + '4'] + data[letter + '5'] + data[letter + '6'] + data[letter + '7'] + data[letter + '8']
data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['R'] = sum_dimension(data, 'R')
data['I'] = sum_dimension(data, 'I')
data['A'] = sum_dimension(data, 'A')
data['S'] = sum_dimension(data, 'S')
data['E'] = sum_dimension(data, 'E')
data['C'] = sum_dimension(data, 'C')
data['VCL'] = data['VCL1'] + data['VCL2'] + data['VCL3'] + data['VCL4'] + data['VCL5'] + data['VCL6'] + data['VCL7'] + data['VCL8'] + data['VCL9'] + data['VCL10'] + data['VCL11'] + data['VCL12'] + data['VCL13'] + data['VCL14'] + data['VCL15'] + data['VCL16']
view = data.loc[:, ['R', 'I', 'A', 'S', 'E', 'C', 'VCL', 'major']]
view['major'] = view['major'].str.lower()
view['major'] = view.apply(lambda row: row['major'].strip() if row['major'] is not np.nan else np.nan, axis=1)
view = view.groupby('major').aggregate(['mean', 'count'])
view = view[view['VCL','count'] > 1110]
view.loc[:,[('R','mean'), ('I','mean'),('A','mean'), ('S','mean'),('C','mean'), ('C','mean')]].plot(kind='barh', figsize=(14,5))
plt.show()
Which results in the following diagram.

Biology has high I (Investigative, people that prefer to work with data). While the R (Realistic, People who like to work with things) is dominated by Engineers and Computer Scientist.
Hmm… I should have noticed that many have major education.
Learn Python

Learn Python A BEGINNERS GUIDE TO PYTHON
- 70 pages to get you started on your journey to master Python.
- How to install your setup with Anaconda.
- Written description and introduction to all concepts.
- Jupyter Notebooks prepared for 17 projects.
Python 101: A CRASH COURSE
- How to get started with this 8 hours Python 101: A CRASH COURSE.
- Best practices for learning Python.
- How to download the material to follow along and create projects.
- A chapter for each lesson with a description, code snippets for easy reference, and links to a lesson video.
Expert Data Science Blueprint

Expert Data Science Blueprint
- Master the Data Science Workflow for actionable data insights.
- How to download the material to follow along and create projects.
- A chapter to each lesson with a Description, Learning Objective, and link to the lesson video.
Machine Learning

Machine Learning – The Simple Path to Mastery
- How to get started with Machine Learning.
- How to download the material to follow along and make the projects.
- One chapter for each lesson with a Description, Learning Objectives, and link to the lesson video.