Learn how you can become a Python programmer in just 12 weeks.

    We respect your privacy. Unsubscribe at anytime.

    Pandas: Explore Datasets by Visualization – Exploring the Holland Code (RIASEC) Test – Part II

    What will we cover in this tutorial?

    We will continue our journey to explore a big dataset of 145,000+ respondents to a RIASEC test. If you want to explore the full journey, we recommend you read this tutorial first.

    In this tutorial we will find some data points that are not correct and a potential way to deal with it.

    Step 1: Explore the family sizes from the respondents

    In the first tutorial we looked at how the respondent were distributed around the world. Surprisingly, most countries were represented.

    From previous tutorial.

    In this we will explore the dataset further. The dataset is available here.

    import pandas as pd
    # Only to get a broader summary
    pd.set_option('display.max_rows', 300)
    pd.set_option('display.max_columns', 30)
    pd.set_option('display.width', 1000)
    
    data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
    print(data)
    

    Which will output the following.

            R1  R2  R3  R4  R5  R6  R7  R8  I1  I2  I3  I4  I5  I6  I7  ...  gender  engnat  age  hand  religion  orientation  race  voted  married  familysize  uniqueNetworkLocation  country  source                major  Unnamed: 93
    0        3   4   3   1   1   4   1   3   5   5   4   3   4   5   4  ...       1       1   14     1         7            1     1      2        1           1                      1       US       2                  NaN          NaN
    1        1   1   2   4   1   2   2   1   5   5   5   4   4   4   4  ...       1       1   29     1         7            3     4      1        2           3                      1       US       1              Nursing          NaN
    2        2   1   1   1   1   1   1   1   4   1   1   1   1   1   1  ...       2       1   23     1         7            1     4      2        1           1                      1       US       1                  NaN          NaN
    3        3   1   1   2   2   2   2   2   4   1   2   4   3   2   3  ...       2       2   17     1         0            1     1      2        1           1                      1       CN       0                  NaN          NaN
    4        4   1   1   2   1   1   1   2   5   5   5   3   5   5   5  ...       2       2   18     1         4            3     1      2        1           4                      1       PH       0            education          NaN
    

    If you use the slider, I got curious about how family sizes vary around the world. This dataset is obviously not representing any conclusive data on it, but it could be interesting to see if there is any connection to where you are located in the world and family size.

    Step 2: Explore the distribution of family sizes

    What often happens in dataset is there might be inaccurate data.

    To get a feeling of the data in the column familysize, you can explore it by running this.

    import pandas as pd
    
    data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
    print(data['familysize'].describe())
    print(pd.cut(data['familysize'], bins=[0,1,2,3,4,5,6,7,10,100, 1000000000]).value_counts())
    

    Resulting in the following from the describe output.

    count    1.458280e+05
    mean     1.255801e+05
    std      1.612271e+07
    min      0.000000e+00
    25%      2.000000e+00
    50%      3.000000e+00
    75%      3.000000e+00
    max      2.147484e+09
    Name: familysize, dtype: float64
    

    Where the mean value of family size is 125,580. Well, maybe we don’t count family size the same way, but something is wrong there.

    Grouping the data into bins (by using the cut function combined with value_count) you get this output.

    (1, 2]               51664
    (2, 3]               38653
    (3, 4]               18729
    (0, 1]               15901
    (4, 5]                8265
    (5, 6]                3932
    (6, 7]                1928
    (7, 10]               1904
    (10, 100]              520
    (100, 1000000000]       23
    Name: familysize, dtype: int64
    

    Which indicates 23 families of size greater than 100. Let’s just investigate the sizes in that bucket.

    print(data[data['familysize'] > 100]['familysize'])
    

    Giving us this output.

    1212      2147483647
    3114      2147483647
    5770      2147483647
    8524             104
    9701             103
    21255     2147483647
    24003            999
    26247     2147483647
    27782     2147483647
    31451           9999
    39294           9045
    39298          84579
    49033            900
    54592            232
    58773     2147483647
    74745      999999999
    78643            123
    92457            999
    95916            908
    102680           666
    109429           989
    111488       9234785
    120489          5000
    120505     123456789
    122580          5000
    137141           394
    139226          3425
    140377           934
    142870    2147483647
    145686           377
    145706           666
    Name: familysize, dtype: int64
    

    The integer 2147483647 is interesting as it is the maximum 32-bit positive integer. I think it is safe to say that most family sizes given above 100 are not realistic.

    Step 3: Clean the data

    You need to make a decision on these data points that seem to skew your data in a wrong way.

    Say, you just decide to visualize it without any adjustment, it would give a misrepresentative picture.

    Iceland? What’s up?

    It seems like Iceland has a tradition for big families.

    Let’s investigate that.

    print(data[data['country'] == 'IS']['familysize'])
    

    Interestingly it give only one line that does not seem correct.

    74745     999999999
    

    But as there are only a few respondents the average is the highest.

    To clean the data fully, we can make the decision that family sizes above 10 are not correct. I know, that might be set a bit low and you can choose to do something different.

    Cleaning the data is simple.

    data = data[data['familysize'] < 10]
    

    Magic right? You simply write a conditional that will be vectorized down and only keep those rows of data that fulfill this condition.

    Step 4: Visualize the data

    We will use geopandas, matplotlib and pycountry to visualize it. The process is similar to the one in previous tutorial where you can find more details.

    import geopandas
    import pandas as pd
    import matplotlib.pyplot as plt
    import pycountry
    # Helper function to map country names to alpha_3 representation - though some are not known by library
    def lookup_country_code(country):
        try:
            return pycountry.countries.lookup(country).alpha_3
        except LookupError:
            return country
    
    data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
    
    data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)
    data = data[data['familysize'] < 10]
    country_mean = data.groupby(['alpha3']).mean()
    world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
    map = world.merge(country_mean, how='left', left_on=['iso_a3'], right_on=['alpha3'])
    map.plot('familysize', figsize=(12,4), legend=True)
    plt.show()
    

    Resulting in the following output.

    Family sizes of the respondents

    Looks like there is a one-child policy in China? Again, do not make any conclusions on this data as it is very narrow of this aspect.

    Read the next part here:

    Python for Finance: Unlock Financial Freedom and Build Your Dream Life

    Discover the key to financial freedom and secure your dream life with Python for Finance!

    Say goodbye to financial anxiety and embrace a future filled with confidence and success. If you’re tired of struggling to pay bills and longing for a life of leisure, it’s time to take action.

    Imagine breaking free from that dead-end job and opening doors to endless opportunities. With Python for Finance, you can acquire the invaluable skill of financial analysis that will revolutionize your life.

    Make informed investment decisions, unlock the secrets of business financial performance, and maximize your money like never before. Gain the knowledge sought after by companies worldwide and become an indispensable asset in today’s competitive market.

    Don’t let your dreams slip away. Master Python for Finance and pave your way to a profitable and fulfilling career. Start building the future you deserve today!

    Python for Finance a 21 hours course that teaches investing with Python.

    Learn pandas, NumPy, Matplotlib for Financial Analysis & learn how to Automate Value Investing.

    “Excellent course for anyone trying to learn coding and investing.” – Lorenzo B.

    Leave a Comment