Pandas: Explore Datasets by Visualization – Exploring the Holland Code (RIASEC) Test – Part III

What will we cover in this tutorial?

We will continue our journey to explore a big dataset of 145,000+ respondents to a RIASEC test. If you want to explore the full journey, we recommend you read this tutorial first as well as the second part of the tutorial.

In this part we are going to combine some data into 6 dimensions of personality types of the RIASEC and see it there is any correlation with the educational level.

Step 1: Understand the dataset better

The dataset is combined in letting the respondents rate themselves on statements related to the 6 personality types in RIASEC. The personality types are given as follows (also see wikipedia for deeper description).

  • Realistic (R): People that like to work with things. They tend to be “assertive and competitive, and are interested in activities requiring motor coordination, skill and strength”. They approach problem solving “by doing something, rather than talking about it, or sitting and thinking about it”. They also prefer “concrete approaches to problem solving, rather than abstract theory”. Finally, their interests tend to focus on “scientific or mechanical rather than cultural and aesthetic areas”.
  • Investigative (I): People who prefer to work with “data.” They like to “think and observe rather than act, to organize and understand information rather than to persuade”. They also prefer “individual rather than people oriented activities”.
  • Artistic (A): People who like to work with “ideas and things”. They tend to be “creative, open, inventive, original, perceptive, sensitive, independent and emotional”. They rebel against “structure and rules”, but enjoy “tasks involving people or physical skills”. They tend to be more emotional than the other types.
  • Social (S): People who like to work with “people” and who “seem to satisfy their needs in teaching or helping situations”. They tend to be “drawn more to seek close relationships with other people and are less apt to want to be really intellectual or physical”.
  • Enterprising (E): People who like to work with “people and data”. They tend to be “good talkers, and use this skill to lead or persuade others”. They “also value reputation, power, money and status”.
  • Conventional (C): People who prefer to work with “data” and who “like rules and regulations and emphasize self-control … they like structure and order, and dislike unstructured or unclear work and interpersonal situations”. They also “place value on reputation, power, or status”.

In the test they have rated themselves from 1 to 5 (1=Dislike, 3=Neutral, 5=Enjoy) on statements related to these 6 personality types.

That way each respondent can be rated on these 6 dimensions.

Step 2: Prepare the dataset

We want to score the respondent according to how they have rated themselves on the 8 statements for each of the 6 personality types.

This can be achieved by the following code.

import pandas as pd


def sum_dimension(data, letter):
    return data[letter + '1'] + data[letter + '2'] + data[letter + '3'] + data[letter + '4'] + data[letter + '5'] + data[letter + '6'] + data[letter + '7'] + data[letter + '8']


data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['R'] = sum_dimension(data, 'R')
data['I'] = sum_dimension(data, 'I')
data['A'] = sum_dimension(data, 'A')
data['S'] = sum_dimension(data, 'S')
data['E'] = sum_dimension(data, 'E')
data['C'] = sum_dimension(data, 'C')

view = data.loc[:,['education', 'R', 'I', 'A', 'S', 'E', 'C']]
print(view)

In the view we make, we keep the education with the dimension ratings we have calculated, because we want to see if there is any correlation between education level and personality type.

We get the following output.

        education   R   I   A   S   E   C
0               2  20  33  27  37  16  12
1               2  14  35  19  22  10  10
2               2   9  11  11  30  24  16
3               1  15  21  27  20  25  19
4               3  13  36  34  37  20  26
...           ...  ..  ..  ..  ..  ..  ..
145823          3  10  19  28  28  20  13
145824          3  11  18  39  35  24  16
145825          2   8   8   8  36  12  21
145826          3  29  29  29  34  16  19
145827          2  21  33  19  30  27  24

Where we see the dimensions ratings and the corresponding education level.

Step 3: Compute the correlations

The education is given by the following scale.

  • 1: Less than high school
  • 2: High school
  • 3: University degree
  • 4: Graduate degree
  • 0: No answer

Hence, we need to remove the no-answer group (0) from the data to not skew the results.

import pandas as pd


def sum_dimension(data, letter):
    return data[letter + '1'] + data[letter + '2'] + data[letter + '3'] + data[letter + '4'] + data[letter + '5'] + data[letter + '6'] + data[letter + '7'] + data[letter + '8']


data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['R'] = sum_dimension(data, 'R')
data['I'] = sum_dimension(data, 'I')
data['A'] = sum_dimension(data, 'A')
data['S'] = sum_dimension(data, 'S')
data['E'] = sum_dimension(data, 'E')
data['C'] = sum_dimension(data, 'C')

view = data.loc[:,['education', 'R', 'I', 'A', 'S', 'E', 'C']]

view = view[view['education'] != 0]

print(view.mean())
print(view.groupby('education').mean())
print(view.corr())

The output of the mean is given here.

education     2.394318
R            16.651624
I            23.994637
A            22.887701
S            26.079349
E            20.490080
C            19.105188
dtype: float64

Which says that the average educational level of the 145,000+ respondents was 2.394318. Then you can see the respondent related on average mostly as Social, then Investigative. The lowest rated group was Realistic.

The output of educational group by mean is given here.

                   R          I          A          S          E          C
education                                                                  
1          15.951952  23.103728  21.696007  23.170792  19.897772  17.315641
2          16.775297  23.873645  22.379625  25.936032  20.864591  19.551138
3          16.774487  24.302158  23.634034  27.317784  20.468160  19.606312
4          16.814534  24.769829  24.347250  27.382699  20.038501  18.762395

Where you can see that those with less than high school actually rate themselves lower in all dimensions. While the highest educated rate themselves highest on Realistic, Artistic, and Social.

Does that mean the higher education the more social, artistic or realistic you are?

The output of the correlation is given here.

           education         R         I         A         S         E         C
education   1.000000  0.029008  0.057466  0.105946  0.168640 -0.006115  0.044363
R           0.029008  1.000000  0.303895  0.206085  0.109370  0.340535  0.489504
I           0.057466  0.303895  1.000000  0.334159  0.232608  0.080878  0.126554
A           0.105946  0.206085  0.334159  1.000000  0.350631  0.322099  0.056576
S           0.168640  0.109370  0.232608  0.350631  1.000000  0.411564  0.213413
E          -0.006115  0.340535  0.080878  0.322099  0.411564  1.000000  0.526813
C           0.044363  0.489504  0.126554  0.056576  0.213413  0.526813  1.000000

As you see. You should conclude that. Take Social it is only 0.168640 correlated to education, which in other words means very low correlated. The same holds for Realistic and Artistic, very low correlation.

Step 4: Visualize our findings

A way to visualize the data is by using the great integration with Matplotlib.

import pandas as pd
import matplotlib.pyplot as plt


def sum_dimension(data, letter):
    return data[letter + '1'] + data[letter + '2'] + data[letter + '3'] + data[letter + '4'] + data[letter + '5'] + data[letter + '6'] + data[letter + '7'] + data[letter + '8']


data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['R'] = sum_dimension(data, 'R')
data['I'] = sum_dimension(data, 'I')
data['A'] = sum_dimension(data, 'A')
data['S'] = sum_dimension(data, 'S')
data['E'] = sum_dimension(data, 'E')
data['C'] = sum_dimension(data, 'C')

view = data.loc[:,['education', 'R', 'I', 'A', 'S', 'E', 'C']]

view = view[view['education'] != 0]

edu = view.groupby('education').mean()
edu.index = ['> high school', 'high school', 'university', 'graduate']
edu.plot(kind='barh', figsize=(10,4))
plt.show()

Resulting in the following graph.

The output.

Finally, the correlation to education can be made similarly.

Note that the education itself was kept to have a perspective of full correlation.

Continue to read how to explore the dataset in the next tutorial.

Pandas: Explore Datasets by Visualization – Exploring the Holland Code (RIASEC) Test – Part II

What will we cover in this tutorial?

We will continue our journey to explore a big dataset of 145,000+ respondents to a RIASEC test. If you want to explore the full journey, we recommend you read this tutorial first.

In this tutorial we will find some data points that are not correct and a potential way to deal with it.

Step 1: Explore the family sizes from the respondents

In the first tutorial we looked at how the respondent were distributed around the world. Surprisingly, most countries were represented.

From previous tutorial.

In this we will explore the dataset further. The dataset is available here.

import pandas as pd

# Only to get a broader summary
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 30)
pd.set_option('display.width', 1000)


data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
print(data)

Which will output the following.

        R1  R2  R3  R4  R5  R6  R7  R8  I1  I2  I3  I4  I5  I6  I7  ...  gender  engnat  age  hand  religion  orientation  race  voted  married  familysize  uniqueNetworkLocation  country  source                major  Unnamed: 93
0        3   4   3   1   1   4   1   3   5   5   4   3   4   5   4  ...       1       1   14     1         7            1     1      2        1           1                      1       US       2                  NaN          NaN
1        1   1   2   4   1   2   2   1   5   5   5   4   4   4   4  ...       1       1   29     1         7            3     4      1        2           3                      1       US       1              Nursing          NaN
2        2   1   1   1   1   1   1   1   4   1   1   1   1   1   1  ...       2       1   23     1         7            1     4      2        1           1                      1       US       1                  NaN          NaN
3        3   1   1   2   2   2   2   2   4   1   2   4   3   2   3  ...       2       2   17     1         0            1     1      2        1           1                      1       CN       0                  NaN          NaN
4        4   1   1   2   1   1   1   2   5   5   5   3   5   5   5  ...       2       2   18     1         4            3     1      2        1           4                      1       PH       0            education          NaN

If you use the slider, I got curious about how family sizes vary around the world. This dataset is obviously not representing any conclusive data on it, but it could be interesting to see if there is any connection to where you are located in the world and family size.

Step 2: Explore the distribution of family sizes

What often happens in dataset is there might be inaccurate data.

To get a feeling of the data in the column familysize, you can explore it by running this.

import pandas as pd


data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)

print(data['familysize'].describe())
print(pd.cut(data['familysize'], bins=[0,1,2,3,4,5,6,7,10,100, 1000000000]).value_counts())

Resulting in the following from the describe output.

count    1.458280e+05
mean     1.255801e+05
std      1.612271e+07
min      0.000000e+00
25%      2.000000e+00
50%      3.000000e+00
75%      3.000000e+00
max      2.147484e+09
Name: familysize, dtype: float64

Where the mean value of family size is 125,580. Well, maybe we don’t count family size the same way, but something is wrong there.

Grouping the data into bins (by using the cut function combined with value_count) you get this output.

(1, 2]               51664
(2, 3]               38653
(3, 4]               18729
(0, 1]               15901
(4, 5]                8265
(5, 6]                3932
(6, 7]                1928
(7, 10]               1904
(10, 100]              520
(100, 1000000000]       23
Name: familysize, dtype: int64

Which indicates 23 families of size greater than 100. Let’s just investigate the sizes in that bucket.

print(data[data['familysize'] > 100]['familysize'])

Giving us this output.

1212      2147483647
3114      2147483647
5770      2147483647
8524             104
9701             103
21255     2147483647
24003            999
26247     2147483647
27782     2147483647
31451           9999
39294           9045
39298          84579
49033            900
54592            232
58773     2147483647
74745      999999999
78643            123
92457            999
95916            908
102680           666
109429           989
111488       9234785
120489          5000
120505     123456789
122580          5000
137141           394
139226          3425
140377           934
142870    2147483647
145686           377
145706           666
Name: familysize, dtype: int64

The integer 2147483647 is interesting as it is the maximum 32-bit positive integer. I think it is safe to say that most family sizes given above 100 are not realistic.

Step 3: Clean the data

You need to make a decision on these data points that seem to skew your data in a wrong way.

Say, you just decide to visualize it without any adjustment, it would give a misrepresentative picture.

Iceland? What’s up?

It seems like Iceland has a tradition for big families.

Let’s investigate that.

print(data[data['country'] == 'IS']['familysize'])

Interestingly it give only one line that does not seem correct.

74745     999999999

But as there are only a few respondents the average is the highest.

To clean the data fully, we can make the decision that family sizes above 10 are not correct. I know, that might be set a bit low and you can choose to do something different.

Cleaning the data is simple.

data = data[data['familysize'] < 10]

Magic right? You simply write a conditional that will be vectorized down and only keep those rows of data that fulfill this condition.

Step 4: Visualize the data

We will use geopandas, matplotlib and pycountry to visualize it. The process is similar to the one in previous tutorial where you can find more details.

import geopandas
import pandas as pd
import matplotlib.pyplot as plt
import pycountry

# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
    try:
        return pycountry.countries.lookup(country).alpha_3
    except LookupError:
        return country


data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)


data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)
data = data[data['familysize'] < 10]

country_mean = data.groupby(['alpha3']).mean()

world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
map = world.merge(country_mean, how='left', left_on=['iso_a3'], right_on=['alpha3'])
map.plot('familysize', figsize=(12,4), legend=True)
plt.show()

Resulting in the following output.

Family sizes of the respondents

Looks like there is a one-child policy in China? Again, do not make any conclusions on this data as it is very narrow of this aspect.

Read the next part here:

Exit mobile version