Learn how you can become a Python programmer in just 12 weeks.

    We respect your privacy. Unsubscribe at anytime.

    Pandas: Explore Datasets by Visualization – Exploring the Holland Code (RIASEC) Test

    What will we cover in this tutorial

    We will explore a dataset with the Holland Code (RIASEC) Test, which is a test that should predict careers and vocational choices by rating questions.

    In this part of the exploration, we first focus on loading the data and visualizing where the respondents come from. The dataset contains more than 145,000 responses.

    You can download the dataset here.

    Step 1: First glance at the data

    Let us first try to see what the data contains.

    Reading the codebook (the file with the dataset) you see it contains ratings of questions of the 6 categories RIASEC. Then there are 3 elapsed times for the test.

    There is a ratings of The Ten Item Personality Inventory. Then a self assessment whether they know 16 words. Finally, a list if metadata on them, like where the respondent network was located (which is a indicator on where the respondent was located in most cases).

    Other metadata can be seen explained here.

    education			"How much education have you completed?", 1=Less than high school, 2=High school, 3=University degree, 4=Graduate degree
    urban				"What type of area did you live when you were a child?", 1=Rural (country side), 2=Suburban, 3=Urban (town, city)
    gender				"What is your gender?", 1=Male, 2=Female, 3=Other
    engnat				"Is English your native language?", 1=Yes, 2=No
    age					"How many years old are you?"
    hand				"What hand do you use to write with?", 1=Right, 2=Left, 3=Both
    religion			"What is your religion?", 1=Agnostic, 2=Atheist, 3=Buddhist, 4=Christian (Catholic), 5=Christian (Mormon), 6=Christian (Protestant), 7=Christian (Other), 8=Hindu, 9=Jewish, 10=Muslim, 11=Sikh, 12=Other
    orientation			"What is your sexual orientation?", 1=Heterosexual, 2=Bisexual, 3=Homosexual, 4=Asexual, 5=Other
    race				"What is your race?", 1=Asian, 2=Arab, 3=Black, 4=Indigenous Australian / Native American / White, 5=Other (There was a coding error in the survey, and three different options were given the same value)
    voted				"Have you voted in a national election in the past year?", 1=Yes, 2=No
    married				"What is your marital status?", 1=Never married, 2=Currently married, 3=Previously married
    familysize			"Including you, how many children did your mother have?"		
    major				"If you attended a university, what was your major (e.g. "psychology", "English", "civil engineering")?"
    
    These values were also calculated for technical information:
    uniqueNetworkLocation	1 if the record is the only one from its network location in the dataset, 2 if there are more than one record. There can be more than one record from the same network if for example that network is shared by a school etc, or it may be because of test retakes
    country	The country of the network the user connected from
    source	1=from Google, 2=from an internal link on the website, 0=from any other website or could not be determined
    

    Step 2: Loading the data into a DataFrame (Pandas)

    First step would be to load the data into a DataFrame. If you are new to Pandas DataFrame, we can recommend this tutorial.

    import pandas as pd
    
    pd.set_option('display.max_rows', 300)
    pd.set_option('display.max_columns', 10)
    pd.set_option('display.width', 150)
    data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
    print(data)
    

    The pd.set_option are only to help get are more rich output, compared to a very small and narrow summary. The actual loading of the data is done by pd.read_csv(…).

    Notice that we have renamed the csv file to riasec.csv. As it is a tab-spaced csv, we need to parse that as an argument if it is not using the default comma.

    The output from the above code is.

            R1  R2  R3  R4  R5  ...  uniqueNetworkLocation  country  source                major  Unnamed: 93
    0        3   4   3   1   1  ...                      1       US       2                  NaN          NaN
    1        1   1   2   4   1  ...                      1       US       1              Nursing          NaN
    2        2   1   1   1   1  ...                      1       US       1                  NaN          NaN
    3        3   1   1   2   2  ...                      1       CN       0                  NaN          NaN
    4        4   1   1   2   1  ...                      1       PH       0            education          NaN
    ...     ..  ..  ..  ..  ..  ...                    ...      ...     ...                  ...          ...
    145823   2   1   1   1   1  ...                      1       US       1        Communication          NaN
    145824   1   1   1   1   1  ...                      1       US       1              Biology          NaN
    145825   1   1   1   1   1  ...                      1       US       2                  NaN          NaN
    145826   3   4   4   5   2  ...                      2       US       0                  yes          NaN
    145827   2   4   1   4   2  ...                      1       US       1  Information systems          NaN
    

    Interestingly, the dataset contains an unnamed last column with no data. That is because it ends each line with a tab (\t) before new line (\n).

    We could clean that up, but as we are only interested in the country counts, we will ignore it in this tutorial.

    Step 3: Count the occurrences of each country

    As said, we are only interested in this first tutorial on this dataset to get an idea of where the respondents come from in the world.

    The data is located in the ‘country’ column of the DataFrame data.

    To group the data, you can use groupby(), which will return af DataFrameGroupBy object. If you apply a size() on that object, it will return a Series with sizes of each group.

    print(data.groupby(['country']).size())
    

    Where the first few lines are.

    country
    AD          2
    AE        507
    AF          8
    AG          7
    AL        116
    AM         10
    

    Hence, for each country we will have a count of how many respondents came from that country.

    Step 4: Understand the map data we want to merge it with

    To visualize the data, we need some way to have a map.

    Here the GeoPandas comes in handy. It contains a nice low-res map of the world you can use.

    Let’s just explore that.

    import geopandas
    import matplotlib.pyplot as plt
    world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
    world.plot()
    plt.show()
    

    Which will make the following map.

    World map using GeoPandas and Maplotlib

    This is too easy to be true. No, not really. This is the reality of Python.

    We want to merge the data from out world map above with the data of counts for each country.

    We need to see how to merge it. To do that let us look at the data from world.

    world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
    print(world)
    

    Where the first few lines are.

            pop_est                continent                      name iso_a3   gdp_md_est                                           geometry
    0        920938                  Oceania                      Fiji    FJI      8374.00  MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
    1      53950935                   Africa                  Tanzania    TZA    150600.00  POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
    2        603253                   Africa                 W. Sahara    ESH       906.50  POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
    3      35623680            North America                    Canada    CAN   1674000.00  MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
    4     326625791            North America  United States of America    USA  18560000.00  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...
    

    First problem arises here. In the other dataset we have 2 letter country codes, in this one they use 3 letter country codes.

    Step 5: Solving the merging problem

    Luckily we can use a library called PyCountry.

    Let’s add this 3 letter country code to our first dataset by using a lambda function. A lambda? New to lambda function, we recommend you read the this tutorial.

    import pandas as pd
    import pycountry
    
    # Helper function to map country names to alpha_3 representation - though some are not known by library
    def lookup_country_code(country):
        try:
            return pycountry.countries.lookup(country).alpha_3
        except LookupError:
            return country
    data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
    data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)
    

    Basically, we add a new column to the dataset and call it ‘alpha3’ with the three letter country code. We use the function apply, which takes the lambda function that actually calls the function outside, which calls the library.

    The reason to so, is that sometimes the pycountry.contries calls makes a lookup exception. We want our program to be robust to that.

    Now the data contains a row with the countries in 3 letters like world.

    We can now merge the data together. Remember that the data we want to merge needs to be adjusted to be counting on ‘alpha3’ and also we want to convert it to a DataFrame (as size() returns a Series).

    import geopandas
    import pandas as pd
    import pycountry
    
    # Helper function to map country names to alpha_3 representation - though some are not known by library
    def lookup_country_code(country):
        try:
            return pycountry.countries.lookup(country).alpha_3
        except LookupError:
            return country
    
    data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
    data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)
    country_count = data.groupby(['alpha3']).size().to_frame()
    country_count.columns = ['count']
    world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
    map = world.merge(country_count, how='left', left_on=['iso_a3'], right_on=['alpha3'])
    print(map)
    

    The first few lines are given below.

            pop_est                continent                      name iso_a3   gdp_md_est                                           geometry    count  \
    0        920938                  Oceania                      Fiji    FJI      8374.00  MULTIPOLYGON (((180.00000 -16.06713, 180.00000...     12.0   
    1      53950935                   Africa                  Tanzania    TZA    150600.00  POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...      9.0   
    2        603253                   Africa                 W. Sahara    ESH       906.50  POLYGON ((-8.66559 27.65643, -8.66512 27.58948...      NaN   
    3      35623680            North America                    Canada    CAN   1674000.00  MULTIPOLYGON (((-122.84000 49.00000, -122.9742...   7256.0   
    4     326625791            North America  United States of America    USA  18560000.00  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...  80579.0   
    5      18556698                     Asia                Kazakhstan    KAZ    460700.00  POLYGON ((87.35997 49.21498, 86.59878 48.54918...     46.0   
    

    Notice, that some countries do not have a count. Those a countries with no respondent.

    Step 6: Ready to plot a world map

    Now to the hard part, right?

    Making a colorful map indicating the number of respondents in a given country.

    import geopandas
    import pandas as pd
    import matplotlib.pyplot as plt
    import pycountry
    import numpy as np
    
    # Helper function to map country names to alpha_3 representation - though some are not known by library
    def lookup_country_code(country):
        try:
            return pycountry.countries.lookup(country).alpha_3
        except LookupError:
            return country
    
    data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
    data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)
    country_count = data.groupby(['alpha3']).size().to_frame()
    country_count.columns = ['count']
    world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
    map = world.merge(country_count, how='left', left_on=['iso_a3'], right_on=['alpha3'])
    map.plot('count', figsize=(10,3), legend=True)
    plt.show()
    

    It is easy. Just call plot(…) with the first argument to be the column to use. I also change the default figsize, you can play around with that. Finally I add the legend.

    The output

    Not really satisfying. The problem is that all counties, but USA, have almost identical colors. Looking at the data, you will see that it is because that there are so many respondents in USA that the countries are in the bottom of the scale.

    What to do? Use a log-scale.

    You can actually do that directly in your DataFrame. By using a NumPy library we can calculate a logarithmic scale.

    See the magic.

    import geopandas
    import pandas as pd
    import matplotlib.pyplot as plt
    import pycountry
    import numpy as np
    
    # Helper function to map country names to alpha_3 representation - though some are not known by library
    def lookup_country_code(country):
        try:
            return pycountry.countries.lookup(country).alpha_3
        except LookupError:
            return country
    
    data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
    data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)
    country_count = data.groupby(['alpha3']).size().to_frame()
    country_count.columns = ['count']
    country_count['log_count'] = np.log(country_count['count'])
    world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
    map = world.merge(country_count, how='left', left_on=['iso_a3'], right_on=['alpha3'])
    map.plot('log_count', figsize=(10,3), legend=True)
    plt.show()
    

    Where the new magic is to add the log_count and using np.log(country_count[‘count’]).

    Also notice that the plot is now done on ‘log_count’.

    The final output.

    Now you see more of a variety in the countries respondents. Note that the “white” countries did not have any respondent.

    Read the next exploration of the dataset here.

    Next exploration.

    Python Circle

    Do you know what the 5 key success factors every programmer must have?

    How is it possible that some people become programmer so fast?

    While others struggle for years and still fail.

    Not only do they learn python 10 times faster they solve complex problems with ease.

    What separates them from the rest?

    I identified these 5 success factors that every programmer must have to succeed:

    1. Collaboration: sharing your work with others and receiving help with any questions or challenges you may have.
    2. Networking: the ability to connect with the right people and leverage their knowledge, experience, and resources.
    3. Support: receive feedback on your work and ask questions without feeling intimidated or judged.
    4. Accountability: stay motivated and accountable to your learning goals by surrounding yourself with others who are also committed to learning Python.
    5. Feedback from the instructor: receiving feedback and support from an instructor with years of experience in the field.

    I know how important these success factors are for growth and progress in mastering Python.

    That is why I want to make them available to anyone struggling to learn or who just wants to improve faster.

    With the Python Circle community, you can take advantage of 5 key success factors every programmer must have.

    Python Circle
    Python Circle

    Be part of something bigger and join the Python Circle community.

    4 thoughts on “Pandas: Explore Datasets by Visualization – Exploring the Holland Code (RIASEC) Test”

      • Hi Nasir,
        I would start with Linear Regression.
        It can help you identify variables that are correlated.
        Try different hypothesis and evaluate them with Linear Regression.
        You can also look into the correlation method on pandas (corr()).

        For inspiration to Linear Regression see this post.

        Cheers, Rune

        Reply

    Leave a Comment