How to Choose the Best Machine Learning Model

What will you learn?

This guide will help you to choose the right Machine Learning model for your project. It will also teach you that there is no best model, as all models have predictive error. This means, that you should seek a model that is good enough.

You will learn about Model Selection Techniques like Probabilistic Measures and Resampling Methods.

Step 1: Problem type

The process of selecting the a model for your Machine Learning starts with the type of Problem you work with.

There are 3 high level types of problems.

  • What kind of problem are you looking into?
    • ClassificationPredict labels on data with predefined classes
      • Supervised Machine Learning
    • ClusteringIdentify similarieties between objects and group them in clusters
      • Unsupervised Machine Learning
    • RegressionPredict continuous values
      • Supervised Machine Learning

A great guide is the Sklearn cheat sheet, which helps you to narrow down using the problem types.

Step 2: Model Selection Techniques

As said, all models have predictive errors and the goal isn’t to fit a model 100% on your training-test datasets. Your goal is to have create a simple model, which can predict future values.

This means, that you should seek a model that is good enough for the task.

But how do you do that?

You should use a model selection technique to find a good enough model.

Model Selection Techniques

  • Probabilistic Measures: Scoring by performance and complexity of model.
  • Resampling Methods: Splitting in sub-train and sub-test datasets and scoring by mean values of repeated runs.

Step 3: Example of testing a model

We will look at a dataset and run a few tests. It will not cover in-depth example of the above methods. But it will tweak it and convert a problem type into another category of type. This can actually sometimes be a good approach.

Hence, we take a Regression problem and turn it into a classification problem.

Even though the data is of a regression type of problem, maybe what you are looking for is not the specific values, and you can turn the problem into a classification problem, and get more valuable results from your model.

Let’s try it.

import pandas as pd

data = pd.read_parquet('https://github.com/LearnPythonWithRune/DataScienceWithPython/raw/main/files/house_sales.parquet')

data['SalePrice'].plot.hist(bins=20)

Now – let’s convert it into categories.

  • cut() Bin values into discrete intervals.
    • Data in bins based on data distribution.
  • qcut() Quantile-based discretization function.
    • Data in equal size bins

In this case the qcut is more appropriate, as the data is lying skewed as the diagram shows above.

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import accuracy_score

data['Target'] = pd.qcut(data['SalePrice'], q=3, labels=[1, 2, 3])
data['Target'].value_counts()/len(data)

X = data.drop(['SalePrice', 'Target'], axis=1).fillna(-1)
y = data['Target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

svc = LinearSVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)

accuracy_score(y_test, y_pred)

This makes 3 target groups of equal size and runs a Linear SVC model on it. The accuracy score is around 0.73.

To see if that is good, we will need to experiment a bit.

Also, notice that the group division like we did might not be perfect, as it is done by assigning 33% in each group.

from sklearn.neighbors import KNeighborsClassifier

neigh = KNeighborsClassifier()
neigh.fit(X_train, y_train)
y_pred = neigh.predict(X_test)

accuracy_score(y_test, y_pred)

This gives 0.72.

See more experiments in the video at the top of the page.

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
Data Science

Leave a Reply Cancel reply

Exit mobile version