What is Classification – an Introduction to Machine Learning with pandas

What will we cover?

An introduction to what Machine Learning is and what Classification is. This will be demonstrated on examples using pandas and Sklearn.

Classification is a Machine Learning algorithm that tries to classify rows of data into categories.

Step 1: What is Machine Learning?

  • In the classical computing model every thing is programmed into the algorithms. 
    • This has the limitation that all decision logic need to be understood before usage. 
    • And if things change, we need to modify the program.
  • With the modern computing model (Machine Learning) this paradigm is changes. 
    • We feed the algorithms (models) with data.
    • Based on that data, the algorithms (models) make decisions in the program.

Machine Learning with Python – for Beginners

Machine Learning with Python is a 10+ hours FREE course – a journey from zero to mastery.

  • The course consist of the following content.
    • 15 video lessons – which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution.
    • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
    • 15 projects – with step guides to help you structure your solutions and solution explained in the end of video lessons.

Step 2: How Machine Learning works

Machine learning is divided into two phases.

Phase 1: Learning

  • Get Data: Identify relevant data for the problem you want to solve. This data set should represent the type of data that the Machine Learn model will use to predict from in Phase 2 (predction).
  • Pre-processing: This step is about cleaning up data. While the Machine Learning is awesome, it cannot figure out what good data looks like. You need to do the cleaning as well as transforming data into a desired format.
  • Train model: This is where the magic happens, the learning step (Train model). There are three main paradigms in machine learning.
    • Supervised: where you tell the algorithm what categories each data item is in. Each data item from the training set is tagged with the right answer.
    • Unsupervised: is when the learning algorithm is not told what to do with it and it should make the structure itself.
    • Reinforcement: teaches the machine to think for itself based on past action rewards.
  • Test model: Finally, the testing is done to see if the model is good. The training data was divided into a test set and training set. The test set is used to see if the model can predict from it. If not, a new model might be necessary.

Phase 2: Prediction

Step 3: What is Supervised Learning

Supervised Learning

  • Given a dataset of input-output pairs, learn a function to map inputs to outputs
  • There are different tasks – but we start to focus on Classification

Classification

  • Supervised learning: the task of learning a function mapping an input point to a descrete category

Step 4: Example with Iris Flower Dataset

The Iris Flower dataset is one of the datasets everyone has to work with.

  • Kaggle Iris Flower Dataset
  • Consists of three classes: Iris-setosaIris-versicolor, and Iris-virginica
  • Given depedent features can we predict class
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/iris.csv', index_col=0)
print(data.head())

Step 5: Create a Machine Learning Model

  • A Few Machine Learning Models

The Machine Learning is divided into a few steps – including dividing it into train and test dataset. The train dataset is used to train the model, while the test dataset is used to check the accuracy of the model.

  • Steps
    • Step 1: Assign independent features (those predicting) to X
    • Step 2: Assign classes (labels/dependent features) to y
    • Step 3: Divide into training and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    • Step 4: Create the modelsvc = SVC()
    • Step 5: Fit the modelsvc.fit(X_train, y_train)
    • Step 6: Predict with the modely_pred = svc.predict(X_test)
    • Step 7: Test the accuracyaccuracy_score(y_test, y_pred)

Code example here.

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
X = data.drop('Species', axis=1)
y = data['Species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
accuracy_score(y_test, y_pred)

This gives an accurate model.

You can do the same with KNeighborsClassifier.

from sklearn.neighbors import KNeighborsClassifier
kn = KNeighborsClassifier()
kn.fit(X_train, y_train)
y_pred = kn.predict(X_test)
accuracy_score(y_test, y_pred)

Step 6: Find the most important features

  • permutation_importance Permutation importance for feature evaluation.
  • Use the permutation_importance to calculate it.perm_importance = permutation_importance(svc, X_test, y_test)
  • The results will be found in perm_importance.importances_mean
from sklearn.inspection import permutation_importance
perm_importance = permutation_importance(svc, X_test, y_test)
perm_importance.importances_mean

Visualize the features by importance

  • The most important features are given by perm_importance.importances_mean.argsort()
    • HINT: assign it to sorted_idx
  • To visualize it we can create a DataFramepd.DataFrame(perm_importance.importances_mean[sorted_idx], X_test.columns[sorted_idx], columns=['Value'])
  • Then make a barh plot (use figsize)
sorted_idx = perm_importance.importances_mean.argsort()
df = pd.DataFrame(perm_importance.importances_mean[sorted_idx], X_test.columns[sorted_idx], columns=['Value'])
df.plot.barh()
color_map = {'Iris-setosa': 'b', 'Iris-versicolor': 'r', 'Iris-virginica': 'y'}
colors = data['Species'].apply(lambda x: color_map[x])
data.plot.scatter(x='PetalLengthCm', y='PetalWidthCm', c=colors)

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).

Leave a Reply

%d bloggers like this: