What will we cover in this tutorial?
We will continue our journey to explore a big dataset of 145,000+ respondents to a RIASEC test. If you want to explore the full journey, we recommend you read this tutorial first as well as the second part of the tutorial.
In this part we are going to combine some data into 6 dimensions of personality types of the RIASEC and see it there is any correlation with the educational level.
Step 1: Understand the dataset better
The dataset is combined in letting the respondents rate themselves on statements related to the 6 personality types in RIASEC. The personality types are given as follows (also see wikipedia for deeper description).
- Realistic (R): People that like to work with things. They tend to be “assertive and competitive, and are interested in activities requiring motor coordination, skill and strength”. They approach problem solving “by doing something, rather than talking about it, or sitting and thinking about it”. They also prefer “concrete approaches to problem solving, rather than abstract theory”. Finally, their interests tend to focus on “scientific or mechanical rather than cultural and aesthetic areas”.
- Investigative (I): People who prefer to work with “data.” They like to “think and observe rather than act, to organize and understand information rather than to persuade”. They also prefer “individual rather than people oriented activities”.
- Artistic (A): People who like to work with “ideas and things”. They tend to be “creative, open, inventive, original, perceptive, sensitive, independent and emotional”. They rebel against “structure and rules”, but enjoy “tasks involving people or physical skills”. They tend to be more emotional than the other types.
- Social (S): People who like to work with “people” and who “seem to satisfy their needs in teaching or helping situations”. They tend to be “drawn more to seek close relationships with other people and are less apt to want to be really intellectual or physical”.
- Enterprising (E): People who like to work with “people and data”. They tend to be “good talkers, and use this skill to lead or persuade others”. They “also value reputation, power, money and status”.
- Conventional (C): People who prefer to work with “data” and who “like rules and regulations and emphasize self-control … they like structure and order, and dislike unstructured or unclear work and interpersonal situations”. They also “place value on reputation, power, or status”.
In the test they have rated themselves from 1 to 5 (1=Dislike, 3=Neutral, 5=Enjoy) on statements related to these 6 personality types.
That way each respondent can be rated on these 6 dimensions.
Step 2: Prepare the dataset
We want to score the respondent according to how they have rated themselves on the 8 statements for each of the 6 personality types.
This can be achieved by the following code.
import pandas as pd
def sum_dimension(data, letter):
return data[letter + '1'] + data[letter + '2'] + data[letter + '3'] + data[letter + '4'] + data[letter + '5'] + data[letter + '6'] + data[letter + '7'] + data[letter + '8']
data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['R'] = sum_dimension(data, 'R')
data['I'] = sum_dimension(data, 'I')
data['A'] = sum_dimension(data, 'A')
data['S'] = sum_dimension(data, 'S')
data['E'] = sum_dimension(data, 'E')
data['C'] = sum_dimension(data, 'C')
view = data.loc[:,['education', 'R', 'I', 'A', 'S', 'E', 'C']]
print(view)
In the view we make, we keep the education with the dimension ratings we have calculated, because we want to see if there is any correlation between education level and personality type.
We get the following output.
education R I A S E C
0 2 20 33 27 37 16 12
1 2 14 35 19 22 10 10
2 2 9 11 11 30 24 16
3 1 15 21 27 20 25 19
4 3 13 36 34 37 20 26
... ... .. .. .. .. .. ..
145823 3 10 19 28 28 20 13
145824 3 11 18 39 35 24 16
145825 2 8 8 8 36 12 21
145826 3 29 29 29 34 16 19
145827 2 21 33 19 30 27 24
Where we see the dimensions ratings and the corresponding education level.
Step 3: Compute the correlations
The education is given by the following scale.
- 1: Less than high school
- 2: High school
- 3: University degree
- 4: Graduate degree
- 0: No answer
Hence, we need to remove the no-answer group (0) from the data to not skew the results.
import pandas as pd
def sum_dimension(data, letter):
return data[letter + '1'] + data[letter + '2'] + data[letter + '3'] + data[letter + '4'] + data[letter + '5'] + data[letter + '6'] + data[letter + '7'] + data[letter + '8']
data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['R'] = sum_dimension(data, 'R')
data['I'] = sum_dimension(data, 'I')
data['A'] = sum_dimension(data, 'A')
data['S'] = sum_dimension(data, 'S')
data['E'] = sum_dimension(data, 'E')
data['C'] = sum_dimension(data, 'C')
view = data.loc[:,['education', 'R', 'I', 'A', 'S', 'E', 'C']]
view = view[view['education'] != 0]
print(view.mean())
print(view.groupby('education').mean())
print(view.corr())
The output of the mean is given here.
education 2.394318
R 16.651624
I 23.994637
A 22.887701
S 26.079349
E 20.490080
C 19.105188
dtype: float64
Which says that the average educational level of the 145,000+ respondents was 2.394318. Then you can see the respondent related on average mostly as Social, then Investigative. The lowest rated group was Realistic.
The output of educational group by mean is given here.
R I A S E C
education
1 15.951952 23.103728 21.696007 23.170792 19.897772 17.315641
2 16.775297 23.873645 22.379625 25.936032 20.864591 19.551138
3 16.774487 24.302158 23.634034 27.317784 20.468160 19.606312
4 16.814534 24.769829 24.347250 27.382699 20.038501 18.762395
Where you can see that those with less than high school actually rate themselves lower in all dimensions. While the highest educated rate themselves highest on Realistic, Artistic, and Social.
Does that mean the higher education the more social, artistic or realistic you are?
The output of the correlation is given here.
education R I A S E C
education 1.000000 0.029008 0.057466 0.105946 0.168640 -0.006115 0.044363
R 0.029008 1.000000 0.303895 0.206085 0.109370 0.340535 0.489504
I 0.057466 0.303895 1.000000 0.334159 0.232608 0.080878 0.126554
A 0.105946 0.206085 0.334159 1.000000 0.350631 0.322099 0.056576
S 0.168640 0.109370 0.232608 0.350631 1.000000 0.411564 0.213413
E -0.006115 0.340535 0.080878 0.322099 0.411564 1.000000 0.526813
C 0.044363 0.489504 0.126554 0.056576 0.213413 0.526813 1.000000
As you see. You should conclude that. Take Social it is only 0.168640 correlated to education, which in other words means very low correlated. The same holds for Realistic and Artistic, very low correlation.
Step 4: Visualize our findings
A way to visualize the data is by using the great integration with Matplotlib.
import pandas as pd
import matplotlib.pyplot as plt
def sum_dimension(data, letter):
return data[letter + '1'] + data[letter + '2'] + data[letter + '3'] + data[letter + '4'] + data[letter + '5'] + data[letter + '6'] + data[letter + '7'] + data[letter + '8']
data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['R'] = sum_dimension(data, 'R')
data['I'] = sum_dimension(data, 'I')
data['A'] = sum_dimension(data, 'A')
data['S'] = sum_dimension(data, 'S')
data['E'] = sum_dimension(data, 'E')
data['C'] = sum_dimension(data, 'C')
view = data.loc[:,['education', 'R', 'I', 'A', 'S', 'E', 'C']]
view = view[view['education'] != 0]
edu = view.groupby('education').mean()
edu.index = ['> high school', 'high school', 'university', 'graduate']
edu.plot(kind='barh', figsize=(10,4))
plt.show()
Resulting in the following graph.

Finally, the correlation to education can be made similarly.

Continue to read how to explore the dataset in the next tutorial.
Python Circle
Do you know what the 5 key success factors every programmer must have?
How is it possible that some people become programmer so fast?
While others struggle for years and still fail.
Not only do they learn python 10 times faster they solve complex problems with ease.
What separates them from the rest?
I identified these 5 success factors that every programmer must have to succeed:
- Collaboration: sharing your work with others and receiving help with any questions or challenges you may have.
- Networking: the ability to connect with the right people and leverage their knowledge, experience, and resources.
- Support: receive feedback on your work and ask questions without feeling intimidated or judged.
- Accountability: stay motivated and accountable to your learning goals by surrounding yourself with others who are also committed to learning Python.
- Feedback from the instructor: receiving feedback and support from an instructor with years of experience in the field.
I know how important these success factors are for growth and progress in mastering Python.
That is why I want to make them available to anyone struggling to learn or who just wants to improve faster.
With the Python Circle community, you can take advantage of 5 key success factors every programmer must have.

Be part of something bigger and join the Python Circle community.