## What will we cover?

The goal is to learn about Supervised Learning and explore how to use it for classification.

This includes learning

• What is Supervised Learning
• Understand the classification problem
• What is the Perceptron classifier
• How to use the Perceptron classifier as a linear classifier

## Step 1: What is Supervised Learning?

Supervised learning (SL) is the machine learning task of learning a function that maps an input to an output based on example input-output pairs

wikipedia.org

Said differently, if you have some items you need to classify, it could be books you want to put in categories, say fiction, non-fiction, etc.

Then if you were given a pile of books with the right categories given to them, how can you make a function (the machine learning model), which on other books without labels can guess the right category.

Supervised learning simply means, that in the learning phase, the algorithm (the one creating the model) is given examples with correct labels.

Notice, that supervised learning does not only restrict to classification problems, but it could predict anything.

## Step 2: What is the classification problem?

The classification problem is a supervised learning task of getting a function mapping an input point to a discrete category.

There is binary classification and multiclass classification, where the binary maps into two classes, and the multi classmaps into 3 or more classes.

I find it easiest to understand with examples.

Assume we want to predict if will rain or not rain tomorrow. This is a binary classification problem, because we map into two classes: rain or no rain.

To train the model we need already labelled historic data.

Hence, the task is given rows of historic data with correct labels, train a machine learning model (a Linear Classifier in this case) with this data. Then after that, see how good it can predict future data (without the right class label).

## Step 3: Linear Classification explained mathematically and visually

Some like the math behind an algorithm. If you are not one of them, focus on the visual part – it will give you the understanding you need.

The task of Supervised Learning mathematically can be explained simply with the example data above to find a function f(humidity, pressure) to predict rain or no rain.

Examples

• f(93, 000.7) = rain
• f(49, 1015.5) = no rain
• f(79, 1031.1) = no rain

The goal of Supervised Learning is to approximate the function f – the approximation function is often denoted h.

Why not identify f precisely? Well, because it is not ideal, as this would be an overfitted function, that would predict the historic data 100% accurate, but would fail to predict future values very well.

As we work with Linear Classifiers, we want the function to be linear.

That is, we want the approximation function h, to be on the form.

• x_1: Humidity
• x_2: Pressure
• h(x_1, x_2) = w_0 + w_1*x_1 + w_2*x_2

Hence, the goal is to optimize values w_0, w_1, w_2, to find the best classifier.

What does all this math mean?

Well, that it is a linear classifier that makes decisions based on the value of a linear combination of the characteristics.

The above diagram shows how it would classify with a line whether it will predict rain or not. On the left side, this is the data classified from historic data, and the line shows an optimized line done by the machine learning algorithm.

On the right side, we have a new input data (without label), then with this line, it would classify it as rain (assuming blue means rain).

## Step 4: What is the Perceptron Classifier?

The Perceptron Classifier is a linear algorithm that can be applied to binary classification.

It learns iteratively by adding new knowledge to an already existing line.

The learning rate is given by alpha, and the learning rule is as follows (don’t worry if you don’t understand it – it is not important).

• Given data point x and y update each weight according to this.
• w_i = w_i + alpha*(y – h_w(x)) X x_i

The rule can also be stated as follows.

• w_i = w_i + alpha(actual value – estimated value) X x_i

Said in words, it adjusted the values according to the actual values. Every time a new values comes, it adjusts the weights to fit better accordingly.

Given the line after it has been adjusted to all the training data – then it is ready to predict.

Let’s try this on real data.

## Step 5: Get the Weather data we will use to train a Perceptron model with

You can get all the code in a Jupyter Notebook with the csv file here.

This can be downloaded from the GitHub in a zip file by clicking here.

First let’s just import all the libraries used.

```import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import Perceptron
import matplotlib.pyplot as plt
```

Notice that in the Notebook we have an added line %matplotlib inline, which you should add if you run in a Notebook. The code here will be aligned with PyCharm or a similar IDE.

```data = pd.read_csv('files/weather.csv', parse_dates=True, index_col=0)
```

If you want to read the data directly from GitHub and not download the weather.csv file, you can do that as follows.

```data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/weather.csv', parse_dates=True, index_col=0)
```

This will result in an output similar to this.

```            MinTemp  MaxTemp  Rainfall  ...  RainToday  RISK_MM RainTomorrow
Date                                    ...
2008-02-01     19.5     22.4      15.6  ...        Yes      6.0          Yes
2008-02-02     19.5     25.6       6.0  ...        Yes      6.6          Yes
2008-02-03     21.6     24.5       6.6  ...        Yes     18.8          Yes
2008-02-04     20.2     22.8      18.8  ...        Yes     77.4          Yes
2008-02-05     19.7     25.7      77.4  ...        Yes      1.6          Yes
```

## Step 6: Select features and Clean the Weather data

We want to investigate the data and figure out how much missing data there.

A great way to do that is to use isnull().

```print(data.isnull().sum())
```

This results in the following output.

```MinTemp             3
MaxTemp             2
Rainfall            6
Evaporation        51
Sunshine           16
WindGustDir      1036
WindGustSpeed    1036
WindDir9am         56
WindDir3pm         33
WindSpeed9am       26
WindSpeed3pm       25
Humidity9am        14
Humidity3pm        13
Pressure9am        20
Pressure3pm        19
Cloud9am          566
Cloud3pm          561
Temp9am             4
Temp3pm             4
RainToday           6
RISK_MM             0
RainTomorrow        0
dtype: int64
```

This shows how many rows in each column has null value (missing values). We want to work only with a two features (columns), to keep our classification simple. Obviously, we need to keep RainTomorrow, as that is carrying the label of the class.

We select the features we want and drop the rows with null-values as follows.

```dataset = data[['Humidity3pm', 'Pressure3pm', 'RainTomorrow']].dropna()
```

## Step 7: Split into trading and test data

The next step we need to do is to split the dataset into a features and labels.

But we also want to rename the labels from No and Yes to be numeric.

```X = dataset[['Humidity3pm', 'Pressure3pm']]
y = dataset['RainTomorrow']
y = np.array([0 if value == 'No' else 1 for value in y])
```

Then we do the splitting as follows, where we but a random_state in order to be able to reproduce. This is often a great idea, if you randomness and encounter a problem, then you can reproduce it.

```X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
```

This has divided the features into a train and test set (X_train, X_test), and the labels into a train and test (y_train, y_test) dataset.

## Step 8: Train the Perceptron model and measure accuracy

Finally we want to create the model, fit it (train it), predict on the training data, and print the accuracy score.

```clf = Perceptron(random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_pred))
```

This gives an accuracy of 0.773 or 77,3% accuracy.

Is that good?

Well what if it rains 22.7% of the time? And the model always predicts No rain?

Well, then it is correct 77.3% of the time.

Let’s just check for that.

Well, it is not raining in 74.1% of the time.

```print(sum(y == 0)/len(y))
```

Is that a good model? Well, I find the binary classifiers a bit tricky because of this problem. The best way to get an idea is to visualize it.

## Step 9: Visualize the model predictions

To visualize the data we can do the following.

```fig, ax = plt.subplots()
X_data = X.to_numpy()
y_all = clf.predict(X_data)
ax.scatter(x=X_data[:,0], y=X_data[:,1], c=y_all, alpha=.25)
plt.show()
```

This results in the following output.

Finally, let’s visualize the actual data to compare.

```ax.scatter(x=X_data[:,0], y=X_data[:,1], c=y, alpha=.25)
plt.show()
```

Resulting in.

Here is the full code.

```import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import Perceptron
import matplotlib.pyplot as plt
print(data.isnull().sum())
dataset = data[['Humidity3pm', 'Pressure3pm', 'RainTomorrow']].dropna()
X = dataset[['Humidity3pm', 'Pressure3pm']]
y = dataset['RainTomorrow']
y = np.array([0 if value == 'No' else 1 for value in y])
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
clf = Perceptron(random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_pred))
print(sum(y == 0)/len(y))
fig, ax = plt.subplots()
X_data = X.to_numpy()
y_all = clf.predict(X_data)
ax.scatter(x=X_data[:,0], y=X_data[:,1], c=y_all, alpha=.25)
plt.show()
fig, ax = plt.subplots()
ax.scatter(x=X_data[:,0], y=X_data[:,1], c=y, alpha=.25)
plt.show()
```

This is part of a FREE 10h Machine Learning course with Python.

• 15 video lessons – which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
• 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
• 15 projects – with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).

## What will we cover?

We will demonstrate how to read CSV data from a GitHub. How to group the data by unique values in a column and sum it. Then how to group and sum data on a monthly basis. Finally, how to export this into a multiple sheet Excel document with chart.

## Step 1: Get and inspect the data

We can use pandas to read the CSV data (see more about CSV files here).

```import pandas as pd

url = 'https://raw.githubusercontent.com/LearnPythonWithRune/LearnPython/main/files/SalesData.csv'
data = pd.read_csv(url, delimiter=';', parse_dates=True, index_col='Date')

```

This will read our data directly from GitHub and show the first few lines.

```            Sales rep        Item  Price  Quantity  Sale
Date
2020-05-31        Mia     Markers      4         1     4
2020-02-01        Mia  Desk chair    199         2   398
2020-09-21     Oliver       Table   1099         2  2198
2020-07-15  Charlotte    Desk pad      9         2    18
2020-05-27       Emma        Book     12         1    12
```

This data shows different sales represents and a list over their sales in 2020.

## Step 2: Use GroupBy to get sales of each represent and monthly sales

It is easy to group data by columns. The below code will first group all the Sales reps and sum their sales. Second, it will group the data in months and sum it.

```repr_sales = data.groupby("Sales rep").sum()['Sale']
print(repr_sales)

monthly_sale = data.groupby(pd.Grouper(freq='M')).sum()['Sale']
monthly_sale.index = monthly_sale.index.month_name()
print(monthly_sale)
```

This gives.

```Sales rep
Charlotte     74599
Emma          65867
Ethan         40970
Liam          66989
Mia           88199
Noah          78575
Oliver        89355
Sophia       103480
William       80400
Name: Sale, dtype: int64
```
```Date
January      69990
February     51847
March        67500
April        58401
May          40319
June         59397
July         64251
August       51571
September    55666
October      50093
November     57458
December     61941
Name: Sale, dtype: int64
```

## Step 3: Create a multiple sheet Excel document with charts

Now for the export magic.

```workbook = pd.ExcelWriter("SalesReport.xlsx")
repr_sales.to_excel(workbook, sheet_name='Sales per rep')
monthly_sale.to_excel(workbook, sheet_name='Monthly')

# Configure the first series.
'name':       'Sales per rep',
'categories': '=\'Sales per rep\'!\$A\$2:\$A\$10',
'values':     '=\'Sales per rep\'!\$B\$2:\$B\$10',
})

workbook.sheets['Sales per rep'].insert_chart('D2', chart1)

# Configure the first series.
'name':       'Monthly sales',
'categories': '=Monthly!\$A\$2:\$A\$13',
'values':     '=Monthly!\$B\$2:\$B\$13',
})

workbook.sheets['Monthly'].insert_chart('D2', chart1)

workbook.close()
```

This will create an Excel document called SalesReport.xlsx in your working directory.

To get a detailed explanation see the video in the top of the post.

Want to learn more Python, then this is part of a 8 hours FREE video course with full explanations, projects on each levels, and guided solutions.

The course is structured with the following resources to improve your learning experience.

• 17 video lessons teaching you everything you need to know to get started with Python.
• 34 Jupyter Notebooks with lesson code and projects.
• A FREE 70+ pages eBook with all the learnings from the lessons.

See the full FREE course page here.

Then check out my Machine Learning with Python course.

• 15 video lessons teaching you all aspects of Machine Learning
• 30 JuPyter Notebooks with lesson code and projects
• 10 hours FREE video content to support your learning journey.

Go to the course page for details.

## What will we cover in this tutorial?

We will continue our journey to explore a big dataset of 145,000+ respondents to a RIASEC test. If you want to explore the full journey, we recommend you read this tutorial first as well as the second part of the tutorial, and finally, the third part before continuing.

In this part we will investigate if we can see any correlation between the major of education and the 6 dimensions of the personality types in RIASEC.

## Step 1: Group into major of educations

This is getting tricky, as the majors are typed in by the respondent. We will be missing some connections between them.

But let’s start by exploring them.

```import pandas as pd

major = data.loc[:,['major']]

print(major.groupby('major').size().sort_values(ascending=False))
```

The output is given here.

```major
psychology                6861
Psychology                5763
English                   2342
Biology                   1289
...
Sociology, Social work       1
Sociology, Psychology        1
Sociology, Math              1
Sociology, Linguistics       1
Nuerobiology                 1
Length: 15955, dtype: int64
```

Where we identify one problem, that some write with lowercase and others with uppercase.

## Step 2: Clean up a few ambiguities

The first step would be to lowercase everything.

```import pandas as pd

major = data.loc[:,['major']]
major['major'] = major['major'].str.lower()
print(major.groupby('major').size().sort_values(ascending=False).iloc[:10])
```

Now printing the 10 first lines.

```major
psychology          12766
english              3042
nursing              2142
biology              1961
education            1800
engineering          1353
accounting           1186
computer science     1159
psychology           1098
dtype: int64
```

Where we notice that psychology is the first and last. Inspecting it further, it seems the the last one has a space after it. Hence, we can try to remove whitespaces around all educations.

```import pandas as pd
import numpy as np

major = data.loc[:,['major']]
major['major'] = major['major'].str.lower()
major['major'] = major.apply(lambda row: row['major'].strip() if row['major'] is not np.nan else np.nan, axis=1)

print(major.groupby('major').size().sort_values(ascending=False).iloc[:10])
```

Now the output is as follows.

```major
psychology          13878
english              3240
nursing              2396
biology              2122
education            1954
engineering          1504
accounting           1292
computer science     1240
law                  1111
dtype: int64
```

Introducing law at the bottom of the list.

This process could continue, but let’s keep the focus on these 10 highest representative educations in the dataset. Obviously, further data points could be added if investigating it further.

## Step 3: See if education correlates to known words

First let’s explore the dataset a bit more. The respondents are asked if they know the definitions of the following words.

• boat
• incoherent
• pallid
• robot
• audible
• cuivocal
• paucity
• epistemology
• florted
• decide
• pastiche
• verdid
• abysmal
• lucid
• betray
• funny

Each word they know they mark. Hence, we can count the number of words each respondent knows and calculate an average per major group.

```import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

data['VCL'] = data['VCL1'] + data['VCL2'] + data['VCL3'] + data['VCL4'] + data['VCL5'] + data['VCL6'] + data['VCL7'] + data['VCL8'] + data['VCL9'] + data['VCL10'] + data['VCL11'] + data['VCL12'] + data['VCL13'] + data['VCL14'] + data['VCL15'] + data['VCL16']

view = data.loc[:, ['VCL', 'major']]
view['major'] = view['major'].str.lower()
view['major'] = view.apply(lambda row: row['major'].strip() if row['major'] is not np.nan else np.nan, axis=1)

view = view.groupby('major').aggregate(['mean', 'count'])
view = view[view['VCL','count'] > 1110]
view.loc[:,('VCL','mean')].plot(kind='barh', figsize=(14,5))
plt.show()
```

Which results in the following output.

The Engineers seem to score lower than nursing. Well, I am actually surprised that Computer Science scores that high.

## Step 4: Adding it all up together

Let’s use what we did in previous tutorial and use the calculations from there.

```import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

def sum_dimension(data, letter):
return data[letter + '1'] + data[letter + '2'] + data[letter + '3'] + data[letter + '4'] + data[letter + '5'] + data[letter + '6'] + data[letter + '7'] + data[letter + '8']

data['R'] = sum_dimension(data, 'R')
data['I'] = sum_dimension(data, 'I')
data['A'] = sum_dimension(data, 'A')
data['S'] = sum_dimension(data, 'S')
data['E'] = sum_dimension(data, 'E')
data['C'] = sum_dimension(data, 'C')
data['VCL'] = data['VCL1'] + data['VCL2'] + data['VCL3'] + data['VCL4'] + data['VCL5'] + data['VCL6'] + data['VCL7'] + data['VCL8'] + data['VCL9'] + data['VCL10'] + data['VCL11'] + data['VCL12'] + data['VCL13'] + data['VCL14'] + data['VCL15'] + data['VCL16']

view = data.loc[:, ['R', 'I', 'A', 'S', 'E', 'C', 'VCL', 'major']]
view['major'] = view['major'].str.lower()
view['major'] = view.apply(lambda row: row['major'].strip() if row['major'] is not np.nan else np.nan, axis=1)

view = view.groupby('major').aggregate(['mean', 'count'])
view = view[view['VCL','count'] > 1110]
view.loc[:,[('R','mean'), ('I','mean'),('A','mean'), ('S','mean'),('C','mean'), ('C','mean')]].plot(kind='barh', figsize=(14,5))
plt.show()
```

Which results in the following diagram.

Biology has high I (Investigative, people that prefer to work with data). While the R (Realistic, People who like to work with things) is dominated by Engineers and Computer Scientist.

Hmm… I should have noticed that many have major education.

## What will we cover in this tutorial?

We will continue our journey to explore a big dataset of 145,000+ respondents to a RIASEC test. If you want to explore the full journey, we recommend you read this tutorial first as well as the second part of the tutorial.

In this part we are going to combine some data into 6 dimensions of personality types of the RIASEC and see it there is any correlation with the educational level.

## Step 1: Understand the dataset better

The dataset is combined in letting the respondents rate themselves on statements related to the 6 personality types in RIASEC. The personality types are given as follows (also see wikipedia for deeper description).

• Realistic (R): People that like to work with things. They tend to be “assertive and competitive, and are interested in activities requiring motor coordination, skill and strength”. They approach problem solving “by doing something, rather than talking about it, or sitting and thinking about it”. They also prefer “concrete approaches to problem solving, rather than abstract theory”. Finally, their interests tend to focus on “scientific or mechanical rather than cultural and aesthetic areas”.
• Investigative (I): People who prefer to work with “data.” They like to “think and observe rather than act, to organize and understand information rather than to persuade”. They also prefer “individual rather than people oriented activities”.
• Artistic (A): People who like to work with “ideas and things”. They tend to be “creative, open, inventive, original, perceptive, sensitive, independent and emotional”. They rebel against “structure and rules”, but enjoy “tasks involving people or physical skills”. They tend to be more emotional than the other types.
• Social (S): People who like to work with “people” and who “seem to satisfy their needs in teaching or helping situations”. They tend to be “drawn more to seek close relationships with other people and are less apt to want to be really intellectual or physical”.
• Enterprising (E): People who like to work with “people and data”. They tend to be “good talkers, and use this skill to lead or persuade others”. They “also value reputation, power, money and status”.
• Conventional (C): People who prefer to work with “data” and who “like rules and regulations and emphasize self-control … they like structure and order, and dislike unstructured or unclear work and interpersonal situations”. They also “place value on reputation, power, or status”.

In the test they have rated themselves from 1 to 5 (1=Dislike, 3=Neutral, 5=Enjoy) on statements related to these 6 personality types.

That way each respondent can be rated on these 6 dimensions.

## Step 2: Prepare the dataset

We want to score the respondent according to how they have rated themselves on the 8 statements for each of the 6 personality types.

This can be achieved by the following code.

```import pandas as pd

def sum_dimension(data, letter):
return data[letter + '1'] + data[letter + '2'] + data[letter + '3'] + data[letter + '4'] + data[letter + '5'] + data[letter + '6'] + data[letter + '7'] + data[letter + '8']

data['R'] = sum_dimension(data, 'R')
data['I'] = sum_dimension(data, 'I')
data['A'] = sum_dimension(data, 'A')
data['S'] = sum_dimension(data, 'S')
data['E'] = sum_dimension(data, 'E')
data['C'] = sum_dimension(data, 'C')

view = data.loc[:,['education', 'R', 'I', 'A', 'S', 'E', 'C']]
print(view)
```

In the view we make, we keep the education with the dimension ratings we have calculated, because we want to see if there is any correlation between education level and personality type.

We get the following output.

```        education   R   I   A   S   E   C
0               2  20  33  27  37  16  12
1               2  14  35  19  22  10  10
2               2   9  11  11  30  24  16
3               1  15  21  27  20  25  19
4               3  13  36  34  37  20  26
...           ...  ..  ..  ..  ..  ..  ..
145823          3  10  19  28  28  20  13
145824          3  11  18  39  35  24  16
145825          2   8   8   8  36  12  21
145826          3  29  29  29  34  16  19
145827          2  21  33  19  30  27  24
```

Where we see the dimensions ratings and the corresponding education level.

## Step 3: Compute the correlations

The education is given by the following scale.

• 1: Less than high school
• 2: High school
• 3: University degree

Hence, we need to remove the no-answer group (0) from the data to not skew the results.

```import pandas as pd

def sum_dimension(data, letter):
return data[letter + '1'] + data[letter + '2'] + data[letter + '3'] + data[letter + '4'] + data[letter + '5'] + data[letter + '6'] + data[letter + '7'] + data[letter + '8']

data['R'] = sum_dimension(data, 'R')
data['I'] = sum_dimension(data, 'I')
data['A'] = sum_dimension(data, 'A')
data['S'] = sum_dimension(data, 'S')
data['E'] = sum_dimension(data, 'E')
data['C'] = sum_dimension(data, 'C')

view = data.loc[:,['education', 'R', 'I', 'A', 'S', 'E', 'C']]

view = view[view['education'] != 0]

print(view.mean())
print(view.groupby('education').mean())
print(view.corr())
```

The output of the mean is given here.

```education     2.394318
R            16.651624
I            23.994637
A            22.887701
S            26.079349
E            20.490080
C            19.105188
dtype: float64
```

Which says that the average educational level of the 145,000+ respondents was 2.394318. Then you can see the respondent related on average mostly as Social, then Investigative. The lowest rated group was Realistic.

The output of educational group by mean is given here.

```                   R          I          A          S          E          C
education
1          15.951952  23.103728  21.696007  23.170792  19.897772  17.315641
2          16.775297  23.873645  22.379625  25.936032  20.864591  19.551138
3          16.774487  24.302158  23.634034  27.317784  20.468160  19.606312
4          16.814534  24.769829  24.347250  27.382699  20.038501  18.762395
```

Where you can see that those with less than high school actually rate themselves lower in all dimensions. While the highest educated rate themselves highest on Realistic, Artistic, and Social.

Does that mean the higher education the more social, artistic or realistic you are?

The output of the correlation is given here.

```           education         R         I         A         S         E         C
education   1.000000  0.029008  0.057466  0.105946  0.168640 -0.006115  0.044363
R           0.029008  1.000000  0.303895  0.206085  0.109370  0.340535  0.489504
I           0.057466  0.303895  1.000000  0.334159  0.232608  0.080878  0.126554
A           0.105946  0.206085  0.334159  1.000000  0.350631  0.322099  0.056576
S           0.168640  0.109370  0.232608  0.350631  1.000000  0.411564  0.213413
E          -0.006115  0.340535  0.080878  0.322099  0.411564  1.000000  0.526813
C           0.044363  0.489504  0.126554  0.056576  0.213413  0.526813  1.000000
```

As you see. You should conclude that. Take Social it is only 0.168640 correlated to education, which in other words means very low correlated. The same holds for Realistic and Artistic, very low correlation.

## Step 4: Visualize our findings

A way to visualize the data is by using the great integration with Matplotlib.

```import pandas as pd
import matplotlib.pyplot as plt

def sum_dimension(data, letter):
return data[letter + '1'] + data[letter + '2'] + data[letter + '3'] + data[letter + '4'] + data[letter + '5'] + data[letter + '6'] + data[letter + '7'] + data[letter + '8']

data['R'] = sum_dimension(data, 'R')
data['I'] = sum_dimension(data, 'I')
data['A'] = sum_dimension(data, 'A')
data['S'] = sum_dimension(data, 'S')
data['E'] = sum_dimension(data, 'E')
data['C'] = sum_dimension(data, 'C')

view = data.loc[:,['education', 'R', 'I', 'A', 'S', 'E', 'C']]

view = view[view['education'] != 0]

edu = view.groupby('education').mean()
edu.index = ['> high school', 'high school', 'university', 'graduate']
edu.plot(kind='barh', figsize=(10,4))
plt.show()
```

Resulting in the following graph.

Finally, the correlation to education can be made similarly. Note that the education itself was kept to have a perspective of full correlation.

Continue to read how to explore the dataset in the next tutorial.

## What will we cover in this tutorial?

We will continue our journey to explore a big dataset of 145,000+ respondents to a RIASEC test. If you want to explore the full journey, we recommend you read this tutorial first.

In this tutorial we will find some data points that are not correct and a potential way to deal with it.

## Step 1: Explore the family sizes from the respondents

In the first tutorial we looked at how the respondent were distributed around the world. Surprisingly, most countries were represented.

In this we will explore the dataset further. The dataset is available here.

```import pandas as pd

# Only to get a broader summary
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 30)
pd.set_option('display.width', 1000)

print(data)
```

Which will output the following.

```        R1  R2  R3  R4  R5  R6  R7  R8  I1  I2  I3  I4  I5  I6  I7  ...  gender  engnat  age  hand  religion  orientation  race  voted  married  familysize  uniqueNetworkLocation  country  source                major  Unnamed: 93
0        3   4   3   1   1   4   1   3   5   5   4   3   4   5   4  ...       1       1   14     1         7            1     1      2        1           1                      1       US       2                  NaN          NaN
1        1   1   2   4   1   2   2   1   5   5   5   4   4   4   4  ...       1       1   29     1         7            3     4      1        2           3                      1       US       1              Nursing          NaN
2        2   1   1   1   1   1   1   1   4   1   1   1   1   1   1  ...       2       1   23     1         7            1     4      2        1           1                      1       US       1                  NaN          NaN
3        3   1   1   2   2   2   2   2   4   1   2   4   3   2   3  ...       2       2   17     1         0            1     1      2        1           1                      1       CN       0                  NaN          NaN
4        4   1   1   2   1   1   1   2   5   5   5   3   5   5   5  ...       2       2   18     1         4            3     1      2        1           4                      1       PH       0            education          NaN
```

If you use the slider, I got curious about how family sizes vary around the world. This dataset is obviously not representing any conclusive data on it, but it could be interesting to see if there is any connection to where you are located in the world and family size.

## Step 2: Explore the distribution of family sizes

What often happens in dataset is there might be inaccurate data.

To get a feeling of the data in the column familysize, you can explore it by running this.

```import pandas as pd

print(data['familysize'].describe())
print(pd.cut(data['familysize'], bins=[0,1,2,3,4,5,6,7,10,100, 1000000000]).value_counts())
```

Resulting in the following from the describe output.

```count    1.458280e+05
mean     1.255801e+05
std      1.612271e+07
min      0.000000e+00
25%      2.000000e+00
50%      3.000000e+00
75%      3.000000e+00
max      2.147484e+09
Name: familysize, dtype: float64
```

Where the mean value of family size is 125,580. Well, maybe we don’t count family size the same way, but something is wrong there.

Grouping the data into bins (by using the cut function combined with value_count) you get this output.

```(1, 2]               51664
(2, 3]               38653
(3, 4]               18729
(0, 1]               15901
(4, 5]                8265
(5, 6]                3932
(6, 7]                1928
(7, 10]               1904
(10, 100]              520
(100, 1000000000]       23
Name: familysize, dtype: int64
```

Which indicates 23 families of size greater than 100. Let’s just investigate the sizes in that bucket.

```print(data[data['familysize'] > 100]['familysize'])
```

Giving us this output.

```1212      2147483647
3114      2147483647
5770      2147483647
8524             104
9701             103
21255     2147483647
24003            999
26247     2147483647
27782     2147483647
31451           9999
39294           9045
39298          84579
49033            900
54592            232
58773     2147483647
74745      999999999
78643            123
92457            999
95916            908
102680           666
109429           989
111488       9234785
120489          5000
120505     123456789
122580          5000
137141           394
139226          3425
140377           934
142870    2147483647
145686           377
145706           666
Name: familysize, dtype: int64
```

The integer 2147483647 is interesting as it is the maximum 32-bit positive integer. I think it is safe to say that most family sizes given above 100 are not realistic.

## Step 3: Clean the data

You need to make a decision on these data points that seem to skew your data in a wrong way.

Say, you just decide to visualize it without any adjustment, it would give a misrepresentative picture.

It seems like Iceland has a tradition for big families.

Let’s investigate that.

```print(data[data['country'] == 'IS']['familysize'])
```

Interestingly it give only one line that does not seem correct.

```74745     999999999
```

But as there are only a few respondents the average is the highest.

To clean the data fully, we can make the decision that family sizes above 10 are not correct. I know, that might be set a bit low and you can choose to do something different.

Cleaning the data is simple.

```data = data[data['familysize'] < 10]
```

Magic right? You simply write a conditional that will be vectorized down and only keep those rows of data that fulfill this condition.

## Step 4: Visualize the data

We will use geopandas, matplotlib and pycountry to visualize it. The process is similar to the one in previous tutorial where you can find more details.

```import geopandas
import pandas as pd
import matplotlib.pyplot as plt
import pycountry

# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
try:
return pycountry.countries.lookup(country).alpha_3
except LookupError:
return country

data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)
data = data[data['familysize'] < 10]

country_mean = data.groupby(['alpha3']).mean()

map = world.merge(country_mean, how='left', left_on=['iso_a3'], right_on=['alpha3'])
map.plot('familysize', figsize=(12,4), legend=True)
plt.show()
```

Resulting in the following output.

Looks like there is a one-child policy in China? Again, do not make any conclusions on this data as it is very narrow of this aspect.

## What will we cover in this tutorial

We will explore a dataset with the Holland Code (RIASEC) Test, which is a test that should predict careers and vocational choices by rating questions.

In this part of the exploration, we first focus on loading the data and visualizing where the respondents come from. The dataset contains more than 145,000 responses.

## Step 1: First glance at the data

Let us first try to see what the data contains.

Reading the codebook (the file with the dataset) you see it contains ratings of questions of the 6 categories RIASEC. Then there are 3 elapsed times for the test.

There is a ratings of The Ten Item Personality Inventory. Then a self assessment whether they know 16 words. Finally, a list if metadata on them, like where the respondent network was located (which is a indicator on where the respondent was located in most cases).

Other metadata can be seen explained here.

```education			"How much education have you completed?", 1=Less than high school, 2=High school, 3=University degree, 4=Graduate degree
urban				"What type of area did you live when you were a child?", 1=Rural (country side), 2=Suburban, 3=Urban (town, city)
gender				"What is your gender?", 1=Male, 2=Female, 3=Other
engnat				"Is English your native language?", 1=Yes, 2=No
age					"How many years old are you?"
hand				"What hand do you use to write with?", 1=Right, 2=Left, 3=Both
religion			"What is your religion?", 1=Agnostic, 2=Atheist, 3=Buddhist, 4=Christian (Catholic), 5=Christian (Mormon), 6=Christian (Protestant), 7=Christian (Other), 8=Hindu, 9=Jewish, 10=Muslim, 11=Sikh, 12=Other
orientation			"What is your sexual orientation?", 1=Heterosexual, 2=Bisexual, 3=Homosexual, 4=Asexual, 5=Other
race				"What is your race?", 1=Asian, 2=Arab, 3=Black, 4=Indigenous Australian / Native American / White, 5=Other (There was a coding error in the survey, and three different options were given the same value)
voted				"Have you voted in a national election in the past year?", 1=Yes, 2=No
married				"What is your marital status?", 1=Never married, 2=Currently married, 3=Previously married
familysize			"Including you, how many children did your mother have?"
major				"If you attended a university, what was your major (e.g. "psychology", "English", "civil engineering")?"

These values were also calculated for technical information:

uniqueNetworkLocation	1 if the record is the only one from its network location in the dataset, 2 if there are more than one record. There can be more than one record from the same network if for example that network is shared by a school etc, or it may be because of test retakes
country	The country of the network the user connected from
source	1=from Google, 2=from an internal link on the website, 0=from any other website or could not be determined
```

First step would be to load the data into a DataFrame. If you are new to Pandas DataFrame, we can recommend this tutorial.

```import pandas as pd

pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 150)

print(data)
```

The pd.set_option are only to help get are more rich output, compared to a very small and narrow summary. The actual loading of the data is done by pd.read_csv(…).

Notice that we have renamed the csv file to riasec.csv. As it is a tab-spaced csv, we need to parse that as an argument if it is not using the default comma.

The output from the above code is.

```        R1  R2  R3  R4  R5  ...  uniqueNetworkLocation  country  source                major  Unnamed: 93
0        3   4   3   1   1  ...                      1       US       2                  NaN          NaN
1        1   1   2   4   1  ...                      1       US       1              Nursing          NaN
2        2   1   1   1   1  ...                      1       US       1                  NaN          NaN
3        3   1   1   2   2  ...                      1       CN       0                  NaN          NaN
4        4   1   1   2   1  ...                      1       PH       0            education          NaN
...     ..  ..  ..  ..  ..  ...                    ...      ...     ...                  ...          ...
145823   2   1   1   1   1  ...                      1       US       1        Communication          NaN
145824   1   1   1   1   1  ...                      1       US       1              Biology          NaN
145825   1   1   1   1   1  ...                      1       US       2                  NaN          NaN
145826   3   4   4   5   2  ...                      2       US       0                  yes          NaN
145827   2   4   1   4   2  ...                      1       US       1  Information systems          NaN
```

Interestingly, the dataset contains an unnamed last column with no data. That is because it ends each line with a tab (\t) before new line (\n).

We could clean that up, but as we are only interested in the country counts, we will ignore it in this tutorial.

## Step 3: Count the occurrences of each country

As said, we are only interested in this first tutorial on this dataset to get an idea of where the respondents come from in the world.

The data is located in the ‘country’ column of the DataFrame data.

To group the data, you can use groupby(), which will return af DataFrameGroupBy object. If you apply a size() on that object, it will return a Series with sizes of each group.

```print(data.groupby(['country']).size())
```

Where the first few lines are.

```country
AE        507
AF          8
AG          7
AL        116
AM         10
```

Hence, for each country we will have a count of how many respondents came from that country.

## Step 4: Understand the map data we want to merge it with

To visualize the data, we need some way to have a map.

Here the GeoPandas comes in handy. It contains a nice low-res map of the world you can use.

Let’s just explore that.

```import geopandas
import matplotlib.pyplot as plt

world.plot()
plt.show()
```

Which will make the following map.

This is too easy to be true. No, not really. This is the reality of Python.

We want to merge the data from out world map above with the data of counts for each country.

We need to see how to merge it. To do that let us look at the data from world.

```world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
print(world)
```

Where the first few lines are.

```        pop_est                continent                      name iso_a3   gdp_md_est                                           geometry
0        920938                  Oceania                      Fiji    FJI      8374.00  MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1      53950935                   Africa                  Tanzania    TZA    150600.00  POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2        603253                   Africa                 W. Sahara    ESH       906.50  POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
3      35623680            North America                    Canada    CAN   1674000.00  MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
4     326625791            North America  United States of America    USA  18560000.00  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...
```

First problem arises here. In the other dataset we have 2 letter country codes, in this one they use 3 letter country codes.

## Step 5: Solving the merging problem

Luckily we can use a library called PyCountry.

Let’s add this 3 letter country code to our first dataset by using a lambda function. A lambda? New to lambda function, we recommend you read the this tutorial.

```import pandas as pd
import pycountry

# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
try:
return pycountry.countries.lookup(country).alpha_3
except LookupError:
return country

data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)
```

Basically, we add a new column to the dataset and call it ‘alpha3’ with the three letter country code. We use the function apply, which takes the lambda function that actually calls the function outside, which calls the library.

The reason to so, is that sometimes the pycountry.contries calls makes a lookup exception. We want our program to be robust to that.

Now the data contains a row with the countries in 3 letters like world.

We can now merge the data together. Remember that the data we want to merge needs to be adjusted to be counting on ‘alpha3’ and also we want to convert it to a DataFrame (as size() returns a Series).

```import geopandas
import pandas as pd
import pycountry

# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
try:
return pycountry.countries.lookup(country).alpha_3
except LookupError:
return country

data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)

country_count = data.groupby(['alpha3']).size().to_frame()
country_count.columns = ['count']

map = world.merge(country_count, how='left', left_on=['iso_a3'], right_on=['alpha3'])
print(map)
```

The first few lines are given below.

```        pop_est                continent                      name iso_a3   gdp_md_est                                           geometry    count  \
0        920938                  Oceania                      Fiji    FJI      8374.00  MULTIPOLYGON (((180.00000 -16.06713, 180.00000...     12.0
1      53950935                   Africa                  Tanzania    TZA    150600.00  POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...      9.0
2        603253                   Africa                 W. Sahara    ESH       906.50  POLYGON ((-8.66559 27.65643, -8.66512 27.58948...      NaN
3      35623680            North America                    Canada    CAN   1674000.00  MULTIPOLYGON (((-122.84000 49.00000, -122.9742...   7256.0
4     326625791            North America  United States of America    USA  18560000.00  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...  80579.0
5      18556698                     Asia                Kazakhstan    KAZ    460700.00  POLYGON ((87.35997 49.21498, 86.59878 48.54918...     46.0
```

Notice, that some countries do not have a count. Those a countries with no respondent.

## Step 6: Ready to plot a world map

Now to the hard part, right?

Making a colorful map indicating the number of respondents in a given country.

```import geopandas
import pandas as pd
import matplotlib.pyplot as plt
import pycountry
import numpy as np

# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
try:
return pycountry.countries.lookup(country).alpha_3
except LookupError:
return country

data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)

country_count = data.groupby(['alpha3']).size().to_frame()
country_count.columns = ['count']

map = world.merge(country_count, how='left', left_on=['iso_a3'], right_on=['alpha3'])
map.plot('count', figsize=(10,3), legend=True)
plt.show()
```

It is easy. Just call plot(…) with the first argument to be the column to use. I also change the default figsize, you can play around with that. Finally I add the legend.

Not really satisfying. The problem is that all counties, but USA, have almost identical colors. Looking at the data, you will see that it is because that there are so many respondents in USA that the countries are in the bottom of the scale.

What to do? Use a log-scale.

You can actually do that directly in your DataFrame. By using a NumPy library we can calculate a logarithmic scale.

See the magic.

```import geopandas
import pandas as pd
import matplotlib.pyplot as plt
import pycountry
import numpy as np

# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
try:
return pycountry.countries.lookup(country).alpha_3
except LookupError:
return country

data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)

country_count = data.groupby(['alpha3']).size().to_frame()
country_count.columns = ['count']
country_count['log_count'] = np.log(country_count['count'])

map = world.merge(country_count, how='left', left_on=['iso_a3'], right_on=['alpha3'])
map.plot('log_count', figsize=(10,3), legend=True)
plt.show()
```

Where the new magic is to add the log_count and using np.log(country_count[‘count’]).

Also notice that the plot is now done on ‘log_count’.

Now you see more of a variety in the countries respondents. Note that the “white” countries did not have any respondent.

Read the next exploration of the dataset here.

## What will you learn in this tutorial?

• Where to find interesting data contained in CSV files.
• How to extract a map to plot the data on.
• Use Python to easily plot the data from the CSV file no the map.

## Step 1: Collect the data in CSV format

You can find various interesting data in CSV format on data.world that you can play around with in Python.

In this tutorial we will focus on Shooting Incidents in NYPD from the last year. You can find the data on data.world.

Looking at the data you see that each incident has latitude and longitude coordinates.

```{'INCIDENT_KEY': '184659172', 'OCCUR_DATE': '06/30/2018 12:00:00 AM', 'OCCUR_TIME': '23:41:00', 'BORO': 'BROOKLYN', 'PRECINCT': '75', 'JURISDICTION_CODE': '0', 'LOCATION_DESC': 'PVT HOUSE                     ', 'STATISTICAL_MURDER_FLAG': 'false', 'PERP_AGE_GROUP': '', 'PERP_SEX': '', 'PERP_RACE': '', 'VIC_AGE_GROUP': '25-44', 'VIC_SEX': 'M', 'VIC_RACE': 'BLACK', 'X_COORD_CD': '1020263', 'Y_COORD_CD': '184219', 'Latitude': '40.672250312', 'Longitude': '-73.870176252'}
```

That means we can plot on a map. Let us try to do that.

## Step 2: Export a map to plot the data

We want to plot all the shooting incidents on a map. You can use OpenStreetMap to get an image of a map.

We want a map of New York, which you can find by locating it on OpenStreetMap or pressing the link.

You should press the blue Download in the low right corner of the picture.

Also, remember to get the coordinates of the image in the left side bar, we will need them for the plot.

```map_box = [-74.4461, -73.5123, 40.4166, 41.0359]
```

## Step 3: Writing the Python code that adds data to the map

Importing data from a CVS file is easy and can be done through the standard library csv. Making plot on a graph can be done in matplotlib. If you do not have it installed already, you can do that by typing the following in a command line (or see here).

```pip install matplotlib
```

First you need to transform the CVS data of the longitude and latitude to floats.

```import csv

# The name of the input file might need to be adjusted, or the location needs to be added if it is not located in the same folder as this file.
csv_file = open('nypd-shooting-incident-data-year-to-date-1.csv')
longitude = []
latitude = []
longitude.append(float(row['Longitude']))
latitude.append(float(row['Latitude']))
```

Now you have two lists (longitude and latitude), which contains the coordinates to plot.

Then for the actual plotting into the image.

```import matplotlib.pyplot as plt

# The boundaries of the image map
map_box = [-74.4461, -73.5123, 40.4166, 41.0359]

# The name of the image of the New York map might be different.

fig, ax = plt.subplots()
ax.scatter(longitude, latitude)
ax.set_ylim(map_box, map_box)
ax.set_xlim(map_box, map_box)
ax.imshow(map_img, extent=map_box, alpha=0.9)