What will we cover in this tutorial?
We will continue our journey to explore a big dataset of 145,000+ respondents to a RIASEC test. If you want to explore the full journey, we recommend you read this tutorial first.
In this tutorial we will find some data points that are not correct and a potential way to deal with it.
Step 1: Explore the family sizes from the respondents
In the first tutorial we looked at how the respondent were distributed around the world. Surprisingly, most countries were represented.

In this we will explore the dataset further. The dataset is available here.
import pandas as pd
# Only to get a broader summary
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 30)
pd.set_option('display.width', 1000)
data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
print(data)
Which will output the following.
R1 R2 R3 R4 R5 R6 R7 R8 I1 I2 I3 I4 I5 I6 I7 ... gender engnat age hand religion orientation race voted married familysize uniqueNetworkLocation country source major Unnamed: 93
0 3 4 3 1 1 4 1 3 5 5 4 3 4 5 4 ... 1 1 14 1 7 1 1 2 1 1 1 US 2 NaN NaN
1 1 1 2 4 1 2 2 1 5 5 5 4 4 4 4 ... 1 1 29 1 7 3 4 1 2 3 1 US 1 Nursing NaN
2 2 1 1 1 1 1 1 1 4 1 1 1 1 1 1 ... 2 1 23 1 7 1 4 2 1 1 1 US 1 NaN NaN
3 3 1 1 2 2 2 2 2 4 1 2 4 3 2 3 ... 2 2 17 1 0 1 1 2 1 1 1 CN 0 NaN NaN
4 4 1 1 2 1 1 1 2 5 5 5 3 5 5 5 ... 2 2 18 1 4 3 1 2 1 4 1 PH 0 education NaN
If you use the slider, I got curious about how family sizes vary around the world. This dataset is obviously not representing any conclusive data on it, but it could be interesting to see if there is any connection to where you are located in the world and family size.
Step 2: Explore the distribution of family sizes
What often happens in dataset is there might be inaccurate data.
To get a feeling of the data in the column familysize, you can explore it by running this.
import pandas as pd
data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
print(data['familysize'].describe())
print(pd.cut(data['familysize'], bins=[0,1,2,3,4,5,6,7,10,100, 1000000000]).value_counts())
Resulting in the following from the describe output.
count 1.458280e+05
mean 1.255801e+05
std 1.612271e+07
min 0.000000e+00
25% 2.000000e+00
50% 3.000000e+00
75% 3.000000e+00
max 2.147484e+09
Name: familysize, dtype: float64
Where the mean value of family size is 125,580. Well, maybe we don’t count family size the same way, but something is wrong there.
Grouping the data into bins (by using the cut function combined with value_count) you get this output.
(1, 2] 51664
(2, 3] 38653
(3, 4] 18729
(0, 1] 15901
(4, 5] 8265
(5, 6] 3932
(6, 7] 1928
(7, 10] 1904
(10, 100] 520
(100, 1000000000] 23
Name: familysize, dtype: int64
Which indicates 23 families of size greater than 100. Let’s just investigate the sizes in that bucket.
print(data[data['familysize'] > 100]['familysize'])
Giving us this output.
1212 2147483647
3114 2147483647
5770 2147483647
8524 104
9701 103
21255 2147483647
24003 999
26247 2147483647
27782 2147483647
31451 9999
39294 9045
39298 84579
49033 900
54592 232
58773 2147483647
74745 999999999
78643 123
92457 999
95916 908
102680 666
109429 989
111488 9234785
120489 5000
120505 123456789
122580 5000
137141 394
139226 3425
140377 934
142870 2147483647
145686 377
145706 666
Name: familysize, dtype: int64
The integer 2147483647 is interesting as it is the maximum 32-bit positive integer. I think it is safe to say that most family sizes given above 100 are not realistic.
Step 3: Clean the data
You need to make a decision on these data points that seem to skew your data in a wrong way.
Say, you just decide to visualize it without any adjustment, it would give a misrepresentative picture.

It seems like Iceland has a tradition for big families.
Let’s investigate that.
print(data[data['country'] == 'IS']['familysize'])
Interestingly it give only one line that does not seem correct.
74745 999999999
But as there are only a few respondents the average is the highest.
To clean the data fully, we can make the decision that family sizes above 10 are not correct. I know, that might be set a bit low and you can choose to do something different.
Cleaning the data is simple.
data = data[data['familysize'] < 10]
Magic right? You simply write a conditional that will be vectorized down and only keep those rows of data that fulfill this condition.
Step 4: Visualize the data
We will use geopandas, matplotlib and pycountry to visualize it. The process is similar to the one in previous tutorial where you can find more details.
import geopandas
import pandas as pd
import matplotlib.pyplot as plt
import pycountry
# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
try:
return pycountry.countries.lookup(country).alpha_3
except LookupError:
return country
data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)
data = data[data['familysize'] < 10]
country_mean = data.groupby(['alpha3']).mean()
world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
map = world.merge(country_mean, how='left', left_on=['iso_a3'], right_on=['alpha3'])
map.plot('familysize', figsize=(12,4), legend=True)
plt.show()
Resulting in the following output.

Looks like there is a one-child policy in China? Again, do not make any conclusions on this data as it is very narrow of this aspect.
Read the next part here:
Python for Finance: Unlock Financial Freedom and Build Your Dream Life
Discover the key to financial freedom and secure your dream life with Python for Finance!
Say goodbye to financial anxiety and embrace a future filled with confidence and success. If you’re tired of struggling to pay bills and longing for a life of leisure, it’s time to take action.
Imagine breaking free from that dead-end job and opening doors to endless opportunities. With Python for Finance, you can acquire the invaluable skill of financial analysis that will revolutionize your life.
Make informed investment decisions, unlock the secrets of business financial performance, and maximize your money like never before. Gain the knowledge sought after by companies worldwide and become an indispensable asset in today’s competitive market.
Don’t let your dreams slip away. Master Python for Finance and pave your way to a profitable and fulfilling career. Start building the future you deserve today!
Python for Finance a 21 hours course that teaches investing with Python.
Learn pandas, NumPy, Matplotlib for Financial Analysis & learn how to Automate Value Investing.
“Excellent course for anyone trying to learn coding and investing.” – Lorenzo B.
