Learn NumPy Basics with your first Machine Learning Project

What will we cover?

In this tutorial you will learn some basic NumPy. The best way to learn something new is to combine it with something useful. Therefore you will use the NumPy while creating your first Machine Learning project.

Step 1: What is NumPy?

NumPy is the fundamental package for scientific computing in Python.

NumPy.org

Well, that is how it is stated on the official NumPy page.

Maybe a better question is, what do you use NumPy for and why?

Well, the main tool you use from NumPy is the NumPy array. Arrays are quite similar to Python lists, just with a few restrictions.

  1. It can only contain one data type. That is, if a NumPy array has integers, then all entries can only be integers.
  2. The size cannot change (immutable). That is, you can not add or remove entries, like in a Python list.
  3. If it is a multi-dimension array, all sub-arrays must be of same shape. That is, you cannot have something similar to a Python list of list, where the first sub-list is of length 3, the second of length 7, and so on. They all must have same length (or shape).

Why would anyone use them, you might ask? They are more restrictive than Python lists.

Actually, and funny enough, making the data structures more restrictive, like NumPy arrays, can make it more efficient (faster).

Why?

Well, think about it. You know more about the data structure, and hence, do not need to make many additional checks.

Step 2: A little NumPy array basics we will use for our Machine Learning project

A NumPy array can be created of a list.

import numpy as np

a1 = np.array([1, 2, 3, 4])
print(a1)

Which will print.

array([1, 2, 3, 4])

The data type of a NumPy array can be given as follows.

print(a1.dtype)

It will print dtype(‘int64’). That is, the full array has only one type, int64, which are 64 bit integers. That is also different from Python integers, where you actually cannot specify the size of the integers. Here you can have int8, int16, int32, int64, and more. Again restrictions, which makes it more efficient.

print(a1.shape)

The above gives the shape, here, (4,). Notice, that this shape cannot be changed, because the data structure is immutable.

Let’s create another NumPy array and try a few things.

a1 = np.array([1, 2, 3, 4])
a2 = np.array([5, 6, 7, 8])

print(a1*2)
print(a1*a2)
print(a1 + a2)

Which results in.

array([2, 4, 6, 8])
array([ 5, 12, 21, 32])
array([ 6,  8, 10, 12])

With a little inspection you will realize that the first (a1*2) multiplies with 2 in each entry. The second (a1*a2) multiplies the entries pairwise. The third (a1 + a2) adds the entries pairwise.

Step 3: What is Machine Learning?

  • In the classical computing model every thing is programmed into the algorithms. This has the limitation that all decision logic need to be understood before usage. And if things change, we need to modify the program.
  • With the modern computing model (Machine Learning) this paradigm is changes. We feed the algorithms with data, and based on that data, we do the decisions in the program.

How Machine Learning Works

  • On a high level you can divide Machine Learning into two phases.
    • Phase 1: Learning
    • Phase 2: Prediction
  • The learing phase (Phase 1) can be divided into substeps.
  • It all starts with a training set (training data). This data set should represent the type of data that the Machine Learn model should be used to predict from in Phase 2 (predction).
  • The pre-processing step is about cleaning up data. While the Machine Learning is awesome, it cannot figure out what good data looks like. You need to do the cleaning as well as transforming data into a desired format.
  • Then for the magic, the learning step. There are three main paradigms in machine learning.
    • Supervised: where you tell the algorithm what categories each data item is in. Each data item from the training set is tagged with the right answer.
    • Unsupervised: is when the learning algorithm is not told what to do with it and it should make the structure itself.
    • Reinforcement: teaches the machine to think for itself based on past action rewards.
  • Finally, the testing is done to see if the model is good. The training data was divided into a test set and training set. The test set is used to see if the model can predict from it. If not, a new model might be necessary.

Then the prediction begins.

Step 4: A Linear Regression Model

Let’s try to use a Machine Learning model. One of the first model you will meet is the Linear Regression model.

Simply said, this model tries to fit data to a straight line. The best way to understand that, is to see it visually with one explanatory variable. That is, given a value (explanatory variable), can you predict the scalar response (the value you want to predict.

Say, given the temperature (explanatory variable), can you predict the sale of ice cream. Assuming there is a linear relationship, can you determine that? A guess is, the hotter it is, the more ice cream is sold. But whether a leaner model is a good predictor, is beyond the scope here.

Let’s try with some simple data.

But first we need to import a few libraries.

from sklearn.linear_model import LinearRegression

Then we generate some simple data.

x = [i for i in range(10)]
y = [i for i in range(10)]

For the case, it will be fully correlated, but it will only demonstrate it. This part is equivalent to the Get data step.

But x is the explanatory variable and y the scalar response we want to predict.

When you train the model, you give it input pairs of explanatory and scalar response. This is needed, as the model needs to learn.

After the learning you can predict data. But let’s prepare the data for the learning. This is the Pre-processing.

X = np.array(x).reshape((-1, 1))
Y = np.array(y).reshape((-1, 1))

Notice, this is very simple step, and we only need to convert the data into the correct format.

Then we can train the model (train model).

lin_regressor = LinearRegression()
lin_regressor.fit(X, Y)

Here we will skip the test model step, as the data is simple.

To predict data we can call the model.

Y_pred = lin_regressor.predict(X)

The full code together here.

from sklearn.linear_model import LinearRegression

x = [i for i in range(10)]
y = [i for i in range(10)]

X = np.array(x).reshape((-1, 1))
Y = np.array(y).reshape((-1, 1))

lin_regressor = LinearRegression()
lin_regressor.fit(X, Y)

Y_pred = lin_regressor.predict(X)

Step 5: Visualize the result

You can visualize the data and the prediction as follows.

import matplotlib.pyplot as plt

alpha = str(round(lin_regressor.intercept_[0], 5))
beta = str(round(lin_regressor.coef_[0][0], 5))

fig, ax = plt.subplots()

ax.set_title(f"Alpha {alpha}, Beta {beta}")
ax.scatter(X, Y)
ax.plot(X, Y_pred, c='r')

Alpha is called constant or intercept and measures the value where the regression line crosses the y-axis.

Beta is called coefficient or slope and measures the steepness of the linear regression.

Next step

If you want a real project with Linear Regression, then check out the video in the top of the post, which is part of a full course.

The project will look at car specs to see if there is a connection.

Want to learn more Python, then this is part of a 8 hours FREE video course with full explanations, projects on each levels, and guided solutions.

The course is structured with the following resources to improve your learning experience.

  • 17 video lessons teaching you everything you need to know to get started with Python.
  • 34 Jupyter Notebooks with lesson code and projects.
  • A FREE 70+ pages eBook with all the learnings from the lessons.

See the full FREE course page here.

If you instead want to learn more about Machine Learning. Do not worry.

Then check out my Machine Learning with Python course.

  • 15 video lessons teaching you all aspects of Machine Learning
  • 30 JuPyter Notebooks with lesson code and projects
  • 10 hours FREE video content to support your learning journey.

Go to the course page for details.

Leave a Reply