Learn how you can become a Python programmer in just 12 weeks.

    We respect your privacy. Unsubscribe at anytime.

    Pandas: Explore Datasets by Visualization – Exploring the Holland Code (RIASEC) Test – Part IV

    What will we cover in this tutorial?

    We will continue our journey to explore a big dataset of 145,000+ respondents to a RIASEC test. If you want to explore the full journey, we recommend you read this tutorial first as well as the second part of the tutorial, and finally, the third part before continuing.

    In this part we will investigate if we can see any correlation between the major of education and the 6 dimensions of the personality types in RIASEC.

    Step 1: Group into major of educations

    This is getting tricky, as the majors are typed in by the respondent. We will be missing some connections between them.

    But let’s start by exploring them.

    import pandas as pd
    
    data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
    major = data.loc[:,['major']]
    print(major.groupby('major').size().sort_values(ascending=False))
    

    The output is given here.

    major
    psychology                6861
    Psychology                5763
    English                   2342
    Business                  2290
    Biology                   1289
                              ... 
    Sociology, Social work       1
    Sociology, Psychology        1
    Sociology, Math              1
    Sociology, Linguistics       1
    Nuerobiology                 1
    Length: 15955, dtype: int64
    

    Where we identify one problem, that some write with lowercase and others with uppercase.

    Step 2: Clean up a few ambiguities

    The first step would be to lowercase everything.

    import pandas as pd
    
    data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
    major = data.loc[:,['major']]
    major['major'] = major['major'].str.lower()
    print(major.groupby('major').size().sort_values(ascending=False).iloc[:10])
    

    Now printing the 10 first lines.

    major
    psychology          12766
    business             3496
    english              3042
    nursing              2142
    biology              1961
    education            1800
    engineering          1353
    accounting           1186
    computer science     1159
    psychology           1098
    dtype: int64
    

    Where we notice that psychology is the first and last. Inspecting it further, it seems the the last one has a space after it. Hence, we can try to remove whitespaces around all educations.

    import pandas as pd
    import numpy as np
    
    data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
    major = data.loc[:,['major']]
    major['major'] = major['major'].str.lower()
    major['major'] = major.apply(lambda row: row['major'].strip() if row['major'] is not np.nan else np.nan, axis=1)
    print(major.groupby('major').size().sort_values(ascending=False).iloc[:10])
    

    Now the output is as follows.

    major
    psychology          13878
    business             3848
    english              3240
    nursing              2396
    biology              2122
    education            1954
    engineering          1504
    accounting           1292
    computer science     1240
    law                  1111
    dtype: int64
    

    Introducing law at the bottom of the list.

    This process could continue, but let’s keep the focus on these 10 highest representative educations in the dataset. Obviously, further data points could be added if investigating it further.

    Step 3: See if education correlates to known words

    First let’s explore the dataset a bit more. The respondents are asked if they know the definitions of the following words.

    • boat
    • incoherent
    • pallid
    • robot
    • audible
    • cuivocal
    • paucity
    • epistemology
    • florted
    • decide
    • pastiche
    • verdid
    • abysmal
    • lucid
    • betray
    • funny

    Each word they know they mark. Hence, we can count the number of words each respondent knows and calculate an average per major group.

    import pandas as pd
    import matplotlib.pyplot as plt
    import numpy as np
    
    data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
    data['VCL'] = data['VCL1'] + data['VCL2'] + data['VCL3'] + data['VCL4'] + data['VCL5'] + data['VCL6'] + data['VCL7'] + data['VCL8'] + data['VCL9'] + data['VCL10'] + data['VCL11'] + data['VCL12'] + data['VCL13'] + data['VCL14'] + data['VCL15'] + data['VCL16']
    view = data.loc[:, ['VCL', 'major']]
    view['major'] = view['major'].str.lower()
    view['major'] = view.apply(lambda row: row['major'].strip() if row['major'] is not np.nan else np.nan, axis=1)
    
    view = view.groupby('major').aggregate(['mean', 'count'])
    view = view[view['VCL','count'] > 1110]
    view.loc[:,('VCL','mean')].plot(kind='barh', figsize=(14,5))
    plt.show()
    

    Which results in the following output.

    Average number of the 16 words that each major knows.

    The Engineers seem to score lower than nursing. Well, I am actually surprised that Computer Science scores that high.

    Step 4: Adding it all up together

    Let’s use what we did in previous tutorial and use the calculations from there.

    import pandas as pd
    import matplotlib.pyplot as plt
    import numpy as np
    
    def sum_dimension(data, letter):
        return data[letter + '1'] + data[letter + '2'] + data[letter + '3'] + data[letter + '4'] + data[letter + '5'] + data[letter + '6'] + data[letter + '7'] + data[letter + '8']
    
    data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
    data['R'] = sum_dimension(data, 'R')
    data['I'] = sum_dimension(data, 'I')
    data['A'] = sum_dimension(data, 'A')
    data['S'] = sum_dimension(data, 'S')
    data['E'] = sum_dimension(data, 'E')
    data['C'] = sum_dimension(data, 'C')
    data['VCL'] = data['VCL1'] + data['VCL2'] + data['VCL3'] + data['VCL4'] + data['VCL5'] + data['VCL6'] + data['VCL7'] + data['VCL8'] + data['VCL9'] + data['VCL10'] + data['VCL11'] + data['VCL12'] + data['VCL13'] + data['VCL14'] + data['VCL15'] + data['VCL16']
    view = data.loc[:, ['R', 'I', 'A', 'S', 'E', 'C', 'VCL', 'major']]
    view['major'] = view['major'].str.lower()
    view['major'] = view.apply(lambda row: row['major'].strip() if row['major'] is not np.nan else np.nan, axis=1)
    
    view = view.groupby('major').aggregate(['mean', 'count'])
    view = view[view['VCL','count'] > 1110]
    view.loc[:,[('R','mean'), ('I','mean'),('A','mean'), ('S','mean'),('C','mean'), ('C','mean')]].plot(kind='barh', figsize=(14,5))
    plt.show()
    

    Which results in the following diagram.

    Correlation between major and RIASEC personality traits

    Biology has high I (Investigative, people that prefer to work with data). While the R (Realistic, People who like to work with things) is dominated by Engineers and Computer Scientist.

    Hmm… I should have noticed that many have major education.

    Python Circle

    Do you know what the 5 key success factors every programmer must have?

    How is it possible that some people become programmer so fast?

    While others struggle for years and still fail.

    Not only do they learn python 10 times faster they solve complex problems with ease.

    What separates them from the rest?

    I identified these 5 success factors that every programmer must have to succeed:

    1. Collaboration: sharing your work with others and receiving help with any questions or challenges you may have.
    2. Networking: the ability to connect with the right people and leverage their knowledge, experience, and resources.
    3. Support: receive feedback on your work and ask questions without feeling intimidated or judged.
    4. Accountability: stay motivated and accountable to your learning goals by surrounding yourself with others who are also committed to learning Python.
    5. Feedback from the instructor: receiving feedback and support from an instructor with years of experience in the field.

    I know how important these success factors are for growth and progress in mastering Python.

    That is why I want to make them available to anyone struggling to learn or who just wants to improve faster.

    With the Python Circle community, you can take advantage of 5 key success factors every programmer must have.

    Python Circle
    Python Circle

    Be part of something bigger and join the Python Circle community.

    Leave a Comment