Linear Classifier From Scratch Explained on Real Project

What will we cover?

The goal is to learn about Supervised Learning and explore how to use it for classification.

This includes learning

  • What is Supervised Learning
  • Understand the classification problem
  • What is the Perceptron classifier
  • How to use the Perceptron classifier as a linear classifier

Step 1: What is Supervised Learning?

Supervised learning (SL) is the machine learning task of learning a function that maps an input to an output based on example input-output pairs

wikipedia.org

Said differently, if you have some items you need to classify, it could be books you want to put in categories, say fiction, non-fiction, etc.

Then if you were given a pile of books with the right categories given to them, how can you make a function (the machine learning model), which on other books without labels can guess the right category.

Supervised learning simply means, that in the learning phase, the algorithm (the one creating the model) is given examples with correct labels.

Notice, that supervised learning does not only restrict to classification problems, but it could predict anything.

If you are new to Machine Learning, I advise you start with this tutorial.

Step 2: What is the classification problem?

The classification problem is a supervised learning task of getting a function mapping an input point to a discrete category.

There is binary classification and multiclass classification, where the binary maps into two classes, and the multi classmaps into 3 or more classes.

I find it easiest to understand with examples.

Assume we want to predict if will rain or not rain tomorrow. This is a binary classification problem, because we map into two classes: rain or no rain.

To train the model we need already labelled historic data.

Hence, the task is given rows of historic data with correct labels, train a machine learning model (a Linear Classifier in this case) with this data. Then after that, see how good it can predict future data (without the right class label).

Step 3: Linear Classification explained mathematically and visually

Some like the math behind an algorithm. If you are not one of them, focus on the visual part – it will give you the understanding you need.

The task of Supervised Learning mathematically can be explained simply with the example data above to find a function f(humidity, pressure) to predict rain or no rain.

Examples

  • f(93, 000.7) = rain
  • f(49, 1015.5) = no rain
  • f(79, 1031.1) = no rain

The goal of Supervised Learning is to approximate the function f – the approximation function is often denoted h.

Why not identify f precisely? Well, because it is not ideal, as this would be an overfitted function, that would predict the historic data 100% accurate, but would fail to predict future values very well.

As we work with Linear Classifiers, we want the function to be linear.

That is, we want the approximation function h, to be on the form.

  • x_1: Humidity
  • x_2: Pressure
  • h(x_1, x_2) = w_0 + w_1*x_1 + w_2*x_2

Hence, the goal is to optimize values w_0, w_1, w_2, to find the best classifier.

What does all this math mean?

Well, that it is a linear classifier that makes decisions based on the value of a linear combination of the characteristics.

The above diagram shows how it would classify with a line whether it will predict rain or not. On the left side, this is the data classified from historic data, and the line shows an optimized line done by the machine learning algorithm.

On the right side, we have a new input data (without label), then with this line, it would classify it as rain (assuming blue means rain).

Step 4: What is the Perceptron Classifier?

The Perceptron Classifier is a linear algorithm that can be applied to binary classification.

It learns iteratively by adding new knowledge to an already existing line.

The learning rate is given by alpha, and the learning rule is as follows (don’t worry if you don’t understand it – it is not important).

  • Given data point x and y update each weight according to this.
    • w_i = w_i + alpha*(y – h_w(x)) X x_i

The rule can also be stated as follows.

  • w_i = w_i + alpha(actual value – estimated value) X x_i

Said in words, it adjusted the values according to the actual values. Every time a new values comes, it adjusts the weights to fit better accordingly.

Given the line after it has been adjusted to all the training data – then it is ready to predict.

Let’s try this on real data.

Step 5: Get the Weather data we will use to train a Perceptron model with

You can get all the code in a Jupyter Notebook with the csv file here.

This can be downloaded from the GitHub in a zip file by clicking here.

First let’s just import all the libraries used.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import Perceptron
import matplotlib.pyplot as plt

Notice that in the Notebook we have an added line %matplotlib inline, which you should add if you run in a Notebook. The code here will be aligned with PyCharm or a similar IDE.

Then let’s read the data.

data = pd.read_csv('files/weather.csv', parse_dates=True, index_col=0)
print(data.head())

If you want to read the data directly from GitHub and not download the weather.csv file, you can do that as follows.

data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/weather.csv', parse_dates=True, index_col=0)
print(data.head())

This will result in an output similar to this.

            MinTemp  MaxTemp  Rainfall  ...  RainToday  RISK_MM RainTomorrow
Date                                    ...                                 
2008-02-01     19.5     22.4      15.6  ...        Yes      6.0          Yes
2008-02-02     19.5     25.6       6.0  ...        Yes      6.6          Yes
2008-02-03     21.6     24.5       6.6  ...        Yes     18.8          Yes
2008-02-04     20.2     22.8      18.8  ...        Yes     77.4          Yes
2008-02-05     19.7     25.7      77.4  ...        Yes      1.6          Yes

Step 6: Select features and Clean the Weather data

We want to investigate the data and figure out how much missing data there.

A great way to do that is to use isnull().

print(data.isnull().sum())

This results in the following output.

MinTemp             3
MaxTemp             2
Rainfall            6
Evaporation        51
Sunshine           16
WindGustDir      1036
WindGustSpeed    1036
WindDir9am         56
WindDir3pm         33
WindSpeed9am       26
WindSpeed3pm       25
Humidity9am        14
Humidity3pm        13
Pressure9am        20
Pressure3pm        19
Cloud9am          566
Cloud3pm          561
Temp9am             4
Temp3pm             4
RainToday           6
RISK_MM             0
RainTomorrow        0
dtype: int64

This shows how many rows in each column has null value (missing values). We want to work only with a two features (columns), to keep our classification simple. Obviously, we need to keep RainTomorrow, as that is carrying the label of the class.

We select the features we want and drop the rows with null-values as follows.

dataset = data[['Humidity3pm', 'Pressure3pm', 'RainTomorrow']].dropna()

Step 7: Split into trading and test data

The next step we need to do is to split the dataset into a features and labels.

But we also want to rename the labels from No and Yes to be numeric.

X = dataset[['Humidity3pm', 'Pressure3pm']]
y = dataset['RainTomorrow']
y = np.array([0 if value == 'No' else 1 for value in y])

Then we do the splitting as follows, where we but a random_state in order to be able to reproduce. This is often a great idea, if you randomness and encounter a problem, then you can reproduce it.

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

This has divided the features into a train and test set (X_train, X_test), and the labels into a train and test (y_train, y_test) dataset.

Step 8: Train the Perceptron model and measure accuracy

Finally we want to create the model, fit it (train it), predict on the training data, and print the accuracy score.

clf = Perceptron(random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_pred))

This gives an accuracy of 0.773 or 77,3% accuracy.

Is that good?

Well what if it rains 22.7% of the time? And the model always predicts No rain?

Well, then it is correct 77.3% of the time.

Let’s just check for that.

Well, it is not raining in 74.1% of the time.

print(sum(y == 0)/len(y))

Is that a good model? Well, I find the binary classifiers a bit tricky because of this problem. The best way to get an idea is to visualize it.

Step 9: Visualize the model predictions

To visualize the data we can do the following.

fig, ax = plt.subplots()
X_data = X.to_numpy()
y_all = clf.predict(X_data)
ax.scatter(x=X_data[:,0], y=X_data[:,1], c=y_all, alpha=.25)
plt.show()

This results in the following output.

Finally, let’s visualize the actual data to compare.

ax.scatter(x=X_data[:,0], y=X_data[:,1], c=y, alpha=.25)
plt.show()

Resulting in.

Here is the full code.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import Perceptron
import matplotlib.pyplot as plt
data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/weather.csv', parse_dates=True, index_col=0)
print(data.head())
print(data.isnull().sum())
dataset = data[['Humidity3pm', 'Pressure3pm', 'RainTomorrow']].dropna()
X = dataset[['Humidity3pm', 'Pressure3pm']]
y = dataset['RainTomorrow']
y = np.array([0 if value == 'No' else 1 for value in y])
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
clf = Perceptron(random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_pred))
print(sum(y == 0)/len(y))
fig, ax = plt.subplots()
X_data = X.to_numpy()
y_all = clf.predict(X_data)
ax.scatter(x=X_data[:,0], y=X_data[:,1], c=y_all, alpha=.25)
plt.show()
fig, ax = plt.subplots()
ax.scatter(x=X_data[:,0], y=X_data[:,1], c=y, alpha=.25)
plt.show()

Want to learn more?

This is part of a FREE 10h Machine Learning course with Python.

  • 15 video lessons – which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).

CSV GroupBy Processing to Excel with Charts using Pandas (Python)

What will we cover?

We will demonstrate how to read CSV data from a GitHub. How to group the data by unique values in a column and sum it. Then how to group and sum data on a monthly basis. Finally, how to export this into a multiple sheet Excel document with chart.

Step 1: Get and inspect the data

We can use pandas to read the CSV data (see more about CSV files here).

import pandas as pd
url = 'https://raw.githubusercontent.com/LearnPythonWithRune/LearnPython/main/files/SalesData.csv'
data = pd.read_csv(url, delimiter=';', parse_dates=True, index_col='Date')
print(data.head())

This will read our data directly from GitHub and show the first few lines.

            Sales rep        Item  Price  Quantity  Sale
Date                                                    
2020-05-31        Mia     Markers      4         1     4
2020-02-01        Mia  Desk chair    199         2   398
2020-09-21     Oliver       Table   1099         2  2198
2020-07-15  Charlotte    Desk pad      9         2    18
2020-05-27       Emma        Book     12         1    12

This data shows different sales represents and a list over their sales in 2020.

Step 2: Use GroupBy to get sales of each represent and monthly sales

It is easy to group data by columns. The below code will first group all the Sales reps and sum their sales. Second, it will group the data in months and sum it.

repr_sales = data.groupby("Sales rep").sum()['Sale']
print(repr_sales)
monthly_sale = data.groupby(pd.Grouper(freq='M')).sum()['Sale']
monthly_sale.index = monthly_sale.index.month_name()
print(monthly_sale)

This gives.

Sales rep
Charlotte     74599
Emma          65867
Ethan         40970
Liam          66989
Mia           88199
Noah          78575
Oliver        89355
Sophia       103480
William       80400
Name: Sale, dtype: int64
Date
January      69990
February     51847
March        67500
April        58401
May          40319
June         59397
July         64251
August       51571
September    55666
October      50093
November     57458
December     61941
Name: Sale, dtype: int64

Step 3: Create a multiple sheet Excel document with charts

Now for the export magic.

workbook = pd.ExcelWriter("SalesReport.xlsx")
repr_sales.to_excel(workbook, sheet_name='Sales per rep')
monthly_sale.to_excel(workbook, sheet_name='Monthly')
chart1 = workbook.book.add_chart({'type': 'column'})
# Configure the first series.
chart1.add_series({
    'name':       'Sales per rep',
    'categories': '=\'Sales per rep\'!$A$2:$A$10',
    'values':     '=\'Sales per rep\'!$B$2:$B$10',
})
workbook.sheets['Sales per rep'].insert_chart('D2', chart1)
chart1 = workbook.book.add_chart({'type': 'column'})
# Configure the first series.
chart1.add_series({
    'name':       'Monthly sales',
    'categories': '=Monthly!$A$2:$A$13',
    'values':     '=Monthly!$B$2:$B$13',
})
workbook.sheets['Monthly'].insert_chart('D2', chart1)
workbook.close()

This will create an Excel document called SalesReport.xlsx in your working directory.

To get a detailed explanation see the video in the top of the post.

Want to learn more?

Want to learn more Python, then this is part of a 8 hours FREE video course with full explanations, projects on each levels, and guided solutions.

The course is structured with the following resources to improve your learning experience.

  • 17 video lessons teaching you everything you need to know to get started with Python.
  • 34 Jupyter Notebooks with lesson code and projects.
  • A FREE 70+ pages eBook with all the learnings from the lessons.

See the full FREE course page here.

If you instead want to learn more about Machine Learning. Do not worry.

Then check out my Machine Learning with Python course.

  • 15 video lessons teaching you all aspects of Machine Learning
  • 30 JuPyter Notebooks with lesson code and projects
  • 10 hours FREE video content to support your learning journey.

Go to the course page for details.

Python Like a Pro?

If you’re serious about learning Python, there’s nothing better than strong commits. At your request, we have created an improved version of this popular free online course.

This version has the following benefits to enhance your learning journey.

  1.  Tracking your progress in the course.
  2.  Questionaries to ensure you understand concepts between important lessons.
  3. Downloadable Cheat Sheets for fast lookup what you learned.
  4. Direct Q&A with the instructor to help you to understand the material better.
  5. Added material for better explanations and insider knowledge.
  6. Extra videos with more explanations and stories.
  7.  Certificate at completion.

Start the change in your life and commit to doing something amazing that you have always dreamed of.

Sign up and become part of the exclusive Python Like a Pro elite.

Pandas: Explore Datasets by Visualization – Exploring the Holland Code (RIASEC) Test – Part IV

What will we cover in this tutorial?

We will continue our journey to explore a big dataset of 145,000+ respondents to a RIASEC test. If you want to explore the full journey, we recommend you read this tutorial first as well as the second part of the tutorial, and finally, the third part before continuing.

In this part we will investigate if we can see any correlation between the major of education and the 6 dimensions of the personality types in RIASEC.

Step 1: Group into major of educations

This is getting tricky, as the majors are typed in by the respondent. We will be missing some connections between them.

But let’s start by exploring them.

import pandas as pd

data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
major = data.loc[:,['major']]
print(major.groupby('major').size().sort_values(ascending=False))

The output is given here.

major
psychology                6861
Psychology                5763
English                   2342
Business                  2290
Biology                   1289
                          ... 
Sociology, Social work       1
Sociology, Psychology        1
Sociology, Math              1
Sociology, Linguistics       1
Nuerobiology                 1
Length: 15955, dtype: int64

Where we identify one problem, that some write with lowercase and others with uppercase.

Step 2: Clean up a few ambiguities

The first step would be to lowercase everything.

import pandas as pd

data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
major = data.loc[:,['major']]
major['major'] = major['major'].str.lower()
print(major.groupby('major').size().sort_values(ascending=False).iloc[:10])

Now printing the 10 first lines.

major
psychology          12766
business             3496
english              3042
nursing              2142
biology              1961
education            1800
engineering          1353
accounting           1186
computer science     1159
psychology           1098
dtype: int64

Where we notice that psychology is the first and last. Inspecting it further, it seems the the last one has a space after it. Hence, we can try to remove whitespaces around all educations.

import pandas as pd
import numpy as np

data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
major = data.loc[:,['major']]
major['major'] = major['major'].str.lower()
major['major'] = major.apply(lambda row: row['major'].strip() if row['major'] is not np.nan else np.nan, axis=1)
print(major.groupby('major').size().sort_values(ascending=False).iloc[:10])

Now the output is as follows.

major
psychology          13878
business             3848
english              3240
nursing              2396
biology              2122
education            1954
engineering          1504
accounting           1292
computer science     1240
law                  1111
dtype: int64

Introducing law at the bottom of the list.

This process could continue, but let’s keep the focus on these 10 highest representative educations in the dataset. Obviously, further data points could be added if investigating it further.

Step 3: See if education correlates to known words

First let’s explore the dataset a bit more. The respondents are asked if they know the definitions of the following words.

  • boat
  • incoherent
  • pallid
  • robot
  • audible
  • cuivocal
  • paucity
  • epistemology
  • florted
  • decide
  • pastiche
  • verdid
  • abysmal
  • lucid
  • betray
  • funny

Each word they know they mark. Hence, we can count the number of words each respondent knows and calculate an average per major group.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['VCL'] = data['VCL1'] + data['VCL2'] + data['VCL3'] + data['VCL4'] + data['VCL5'] + data['VCL6'] + data['VCL7'] + data['VCL8'] + data['VCL9'] + data['VCL10'] + data['VCL11'] + data['VCL12'] + data['VCL13'] + data['VCL14'] + data['VCL15'] + data['VCL16']
view = data.loc[:, ['VCL', 'major']]
view['major'] = view['major'].str.lower()
view['major'] = view.apply(lambda row: row['major'].strip() if row['major'] is not np.nan else np.nan, axis=1)

view = view.groupby('major').aggregate(['mean', 'count'])
view = view[view['VCL','count'] > 1110]
view.loc[:,('VCL','mean')].plot(kind='barh', figsize=(14,5))
plt.show()

Which results in the following output.

Average number of the 16 words that each major knows.

The Engineers seem to score lower than nursing. Well, I am actually surprised that Computer Science scores that high.

Step 4: Adding it all up together

Let’s use what we did in previous tutorial and use the calculations from there.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

def sum_dimension(data, letter):
    return data[letter + '1'] + data[letter + '2'] + data[letter + '3'] + data[letter + '4'] + data[letter + '5'] + data[letter + '6'] + data[letter + '7'] + data[letter + '8']

data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['R'] = sum_dimension(data, 'R')
data['I'] = sum_dimension(data, 'I')
data['A'] = sum_dimension(data, 'A')
data['S'] = sum_dimension(data, 'S')
data['E'] = sum_dimension(data, 'E')
data['C'] = sum_dimension(data, 'C')
data['VCL'] = data['VCL1'] + data['VCL2'] + data['VCL3'] + data['VCL4'] + data['VCL5'] + data['VCL6'] + data['VCL7'] + data['VCL8'] + data['VCL9'] + data['VCL10'] + data['VCL11'] + data['VCL12'] + data['VCL13'] + data['VCL14'] + data['VCL15'] + data['VCL16']
view = data.loc[:, ['R', 'I', 'A', 'S', 'E', 'C', 'VCL', 'major']]
view['major'] = view['major'].str.lower()
view['major'] = view.apply(lambda row: row['major'].strip() if row['major'] is not np.nan else np.nan, axis=1)

view = view.groupby('major').aggregate(['mean', 'count'])
view = view[view['VCL','count'] > 1110]
view.loc[:,[('R','mean'), ('I','mean'),('A','mean'), ('S','mean'),('C','mean'), ('C','mean')]].plot(kind='barh', figsize=(14,5))
plt.show()

Which results in the following diagram.

Correlation between major and RIASEC personality traits

Biology has high I (Investigative, people that prefer to work with data). While the R (Realistic, People who like to work with things) is dominated by Engineers and Computer Scientist.

Hmm… I should have noticed that many have major education.