Batch Process Face Detection in 3 Steps with OpenCV

What will you learn?

You want to extract or identify faces on a bunch of images, but how do you do that without becoming a Machine Learning expert?

Here you will learn how to do it without any Machine Learning skills.

Many Machine Learning things are done so often you can just use pre-built Machine Learning models. Here you will learn the task of finding faces and extract locations of them.

Step 1: Pre-built OpenCV models to detect faces

When you think of detecting faces on images, you might get scared. I’ve been there, but there is nothing to be scared of, because some awesome people already did all the hard work for you.

They built a model, which can detect faces on images.

All you need to do, is, to feed it with images and let it do all the work.

This boils down to the following.

  1. We need to know what model to use.
  2. How to feed it with images.
  3. How to use the results it brings and convert it to something useful.

This is what the rest of this tutorial will teach you.

We will use OpenCV and their pre-built detection model haarcascade.

First you should download and install the requirements.

This can be done either by cloning this repository.

Or download the files as a zip-file and unpack them.

You should install opencv-python library. This can be done as follows.

pip install opencv-python

You can also use the requirements.txt file to install it.

pip install -r requirements.txt

Step 2: Detect a face

We will use this image to start with.

The picture is part of the repository from step 1.

Now let’s explore the code in face_detection.py.

# importing opencv
import cv2

# using cv2.CascadeClassifier
# See https://docs.opencv.org/3.4/db/d28/tutorial_cascade_classifier.html
# See more Cascade Classifiers https://github.com/opencv/opencv/tree/4.x/data/haarcascades
face_cascade = cv2.CascadeClassifier("haarcascade_frontalface_default.xml")

img = cv2.imread("sample_images/sample-00.jpg")

# changing the image to gray scale for better face detection
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

faces = face_cascade.detectMultiScale(
    gray,
    scaleFactor=2,  # Big reduction
    minNeighbors=5  # 4-6 range
)

# drawing a rectangle to the image.
# for loop is used to access all the coordinates of the rectangle.
for x, y, w, h in faces:
    cv2.rectangle(img, (x, y), (x+w, y+h), (0, 255, 0), 5)

# showing the detected face followed by the waitKey method.
cv2.imshow("image", img)
cv2.waitKey(0)
cv2.destroyAllWindows()

First notice, that the opencv-python package is imported by import cv2.

Then also, notice we need to run this code in from where the file haarcascade_frontalface_default.xml is located.

After that you will read the image into the variable img. Notice, that this assumes you run the file like they are structure in the GitHub (downloaded in step 1).

When you work with images, you often do not need the level of details given in it. Therefore, the first thing we doit to gray scale the image.

After we have gray scaled the image we use the face detection model (face_cascade.detectMultiScale).

This will give the result faces, which is an iterable.

We want to insert rectangles of the images in the original image (not the gray scaled).

Finally, we show the image and wait until someone hist a key.

Step 3: Batch process face detection

To batch process face detection, a great idea is to build a class to do the face detections. It could be designed in many ways. But the idea is to decouple the filename processing from the actual face detection.

One way to do it could be as follows.

import os
import cv2


class FaceDetector:
    def __init__(self, scale_factor=2, min_neighbors=5):
        self.face_cascade = cv2.CascadeClassifier("haarcascade_frontalface_default.xml")
        self.scale_factor = scale_factor
        self.min_neighbors = min_neighbors
        self.img = None

    def read_image(self, filename):
        self.img = cv2.imread(filename)

    def detect_faces(self):
        gray = cv2.cvtColor(self.img, cv2.COLOR_BGR2GRAY)

        faces = self.face_cascade.detectMultiScale(
            gray,
            scaleFactor=self.scale_factor,
            minNeighbors=self.min_neighbors
        )

        # drawing a rectangle to the image.
        # for loop is used to access all the coordinates of the rectangle.
        for x, y, w, h in faces:
            cv2.rectangle(self.img, (x, y), (x + w, y + h), (0, 255, 0), 5)

        return self.img


face_detector = FaceDetector()

for filename in os.listdir('sample_images/'):
    print(filename)
    face_detector.read_image(f'sample_images/{filename}')
    img = face_detector.detect_faces()

    cv2.imshow("image", img)
    cv2.waitKey(0)
    cv2.destroyAllWindows()

If you want to write the files to storage with face detections, you should exchange the the line cv2.imshow with the following.

    cv2.imwrite(filename, img)

Want to learn more Machine Learning?

You will surprised how easy Machine Learning has become. There are many great and easy to use libraries. All you need to learn is how to train them and use them to predict.

If you want to learn more?

Then I created this 10 hours free Machine Learning course, which will cover all you need.

  • 15 video lessons – which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).

How to make a Formatted Word Cloud in 7 Steps

What will you learn?

At the end of this tutorial you will know how to make a formatted word cloud with Python like this one.

Step 1: Read content

The first thing you need is some content to make to make word frequency on.

In this example we will use the books of Sherlock Holmes – which are available in my GitHub here.

You can clone the repo or just download the full repository as a zip file from the green Code dropdown menu. Then you should see a folder with all the texts of holmes.

We will read them here.

import os

content = []
for filename in os.listdir('holmes/'):
    with open(f'holmes/{filename}') as f:
        content.append(f.read())

Of course you can have any other set of text files.

The result in content is a list of the full content of text of each file. Each file will be raw text with new lines.

Step 2: Corpus in lower case

Here we will use the NLTK toolkit tokenize to get each word.

import nltk

corpus = []
for item in content:
    corpus.extend([word.lower() for word in nltk.word_tokenize(item)])

This creates a list of each word in lower case.

We use list comprehension. If you are new to that check this tutorial.

Step 3: Remove stop words

Stop words are the words with no or little meaning. We do not want to include them in our word cloud, as they are common and take up a lot of space.

from nltk.corpus import stopwords

corpus = [w for w in corpus if w not in stopwords.words('english')]

Again we use list comprehension.

Step 4: Keep alphanumeric words

This can also be done by list comprehension.

corpus = [w for w in corpus if w.isalnum()]

Step 5: Lemmatize words

To lemmatize words is to get them in their root form. We don’t want to have the same word in different forms. We only need it in the basic form. This is what lemmatizing does.

from nltk.corpus import wordnet 
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

corpus = [WordNetLemmatizer().lemmatize(w, get_wordnet_pos(w)) for w in corpus]

Again we use list comprehension to achieve the result.

Step 6: Create a Word Cloud

First we create a simple word cloud.

from wordcloud import WordCloud

unique_string = " ".join(corpus)

wordcloud = WordCloud(width = 1000, height = 500).generate(unique_string)
wordcloud.to_file("word_cloud.png")

This will create an image word_cloud.png similar to this one.

Step 7: Create a formatted Word Cloud

To do that we need a mask. We will use the cloud.png from the repository.

import numpy as np
from PIL import Image

unique_string_v2=(" ").join(corpus)
cloud_mask = np.array(Image.open("cloud.png"))
wordcloud = WordCloud(width = 1000, height = 500, background_color="white",
               mask=cloud_mask, max_words=5000, contour_width=2, contour_color='black')
wordcloud.generate(unique_string_v2)
wordcloud.to_file("word_cloud_masked.png")

This will generate a picture like this one.

Full code

You can get the full code from my GitHub repository.

If you clone it you get the full code as well as all the files you need.

import nltk
from nltk.corpus import stopwords
import os
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
from wordcloud import WordCloud
import numpy as np
from PIL import Image

nltk.download('wordnet')
nltk.download('omw-1.4')

content = []
for filename in os.listdir('holmes/'):
    with open(f'holmes/{filename}') as f:
        content.append(f.read())

corpus = []
for item in content:
    corpus.extend([word.lower() for word in nltk.word_tokenize(item)])

corpus = [w for w in corpus if w not in stopwords.words('english')]

corpus = [w for w in corpus if w.isalnum()]


def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)


corpus = [WordNetLemmatizer().lemmatize(w, get_wordnet_pos(w)) for w in corpus]


unique_string = " ".join(corpus)

wordcloud = WordCloud(width = 1000, height = 500).generate(unique_string)
wordcloud.to_file("word_cloud.png")

unique_string_v2 = " ".join(corpus)
cloud_mask = np.array(Image.open("cloud.png"))
wordcloud = WordCloud(width=1000, height=500, background_color="white",
                      mask=cloud_mask, max_words=5000, contour_width=2, contour_color='black')
wordcloud.generate(unique_string_v2)
wordcloud.to_file("word_cloud_masked.png")

Master Data Visualization for 3 Purposes as a Data Scientist with Full Code Examples

What will we cover?

We will investigate the 3 main purposes of Data Visualization as a Data Scientist.

  • Data Quality: We will demonstrate with examples how you can identify faulty and wrongly formatted data with visualization.
  • Data Exploration: This will teach you how to understand data better with visualization.
  • Data Presentation: Here we explore the main purpose new users think of Data Visualization, to present the findings. This will focus on how to use Data Visualization to confirm your key findings.

But first, we will understand the power of Data Visualization – understand why it is such a powerful tool to master.

Step 1: The Power of Data Visualization

Let’s consider some data – you get it by the following code.

import pandas as pd

sample = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/sample_corr.csv')
print(sample)

The output is – can you spot any connection?

Let’s try to visualize the same data.

Matplotlib is an easy to use visualization library for Python.

import matplotlib.pyplot as plt

sample.plot.scatter(x='x', y='y')
plt.show()

Giving the following output.

And here it is easy to spot that there is some kind of correlation. And actually, you would be able to absorb this connection no matter how many datapoint was there, given the data has the same nature.

What Data Visualization gives

  • Absorb information quickly
  • Improve insights
  • Make faster decisions

Step 2: Data Quality with Visualization

Data Quality is something many like to talk about – but unfortunately there is no precise universal definition of it and is rather context specific.

That said, it is a concept you need to understand as a Data Scientist.

In general Data Quality is about (not only)

  • Missing data (often represented as NA-values)
  • Wrong data (data which cannot be used)
  • Different scaled data (e.g. data in different units without being specified)

Sometimes Data Quality is mixed with aspects witch included Data Wrangling – that could be extracting values from string representations.

Data Quality requires that you know something about the data.

Imagine we are considering a dataset of human heights in centimers. Then we can check that in a histogram.

data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/sample_height.csv')
data.plot.hist()
plt.show()

We immediately realize that some of the data is not correct.

Then we can get the data below 50 cm as follows.

print(data[data['height'] < 50])

Looking at that data, you might realize that could be data inserted in meters and not centimeters. This happens often if data is entered by humans, some might wrongly type in meters and not centimers.

This could mean, that the data is valid, it just need re-scaling.

Another example is checking for outliers – in this case wrong data.

Consider this dataset of human age.

data = pd.read_csv('files/sample_age.csv')
data.plot.hist()
plt.show()

And you see someone with age around 300.

Similarly, you can get it with data[data[‘age’] > 150] and see one of age 314 years.

As you realize that Data Visualization helps you fast to get an idea of Data Quality.

Step 3: Data Exploration with Data Visualization

We already get an idea that Data Visualization helps us to absorb information quickly, but not only that, it also helps us to improve insights in the data enabling us to make faster decisions.

Now we will consider data from the World Bank (The World Bank is a great source of datasets).

Let’s consider the dataset EN.ATM.CO2E.PC.

Now let’s consider some typical Data Visualizations. To get started, we need to read the data.

import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0)

print(data.head)

We see that each year has a row, and each column represent a country (we only see part of them here).

Simple plot

To create a simple plot you can apply the following on a DataFrame.

data['USA'].plot()
plt.show()

A great thing about this is how simple it is to create.

Adding a title and labels is straight forward.

  • title='Tilte' adds the title
  • xlabel='X label' adds or changes the X-label
  • ylabel='X label' adds or changes the Y-label
data['USA'].plot(title='CO2 per capita in USA', ylabel='CO2 per capita')
plt.show()

Another thing you can do is adding ranges to the axis.

  • xlim=(min, max) or xlim=min Sets the x-axis range
  • ylim=(min, max) or ylim=min Sets the y-axis range
data['USA'].plot(title='CO2 per capita in USA', ylabel='CO2 per capita', ylim=0)
plt.show()

If you want to compare two columns in the DataFrame you can do it as follows.

data[['USA', 'WLD']].plot(ylim=0)
plt.show()

If you want to set the figure size of the plot, this can be done as follows.

  • figsize=(width, height) in inches
data[['USA', 'DNK', 'WLD']].plot(ylim=0, figsize=(20,6))
plt.show()

Bar Plot

You can create a bar plot as follows.

  • .plot.bar() Create a bar plot
data['USA'].plot.bar(figsize=(20,6))
plt.show()

Bar plot with two columns.

data[['USA', 'WLD']].plot.bar(figsize=(20,6))
plt.show()

Plot a range.

  • .loc[from:to] apply this on the DataFrame to get a range (both inclusive)
data[['USA', 'WLD']].loc[2000:].plot.bar(figsize=(20,6))
plt.show()

Histograms

You can create a histogram as follows.

  • .plot.hist() Create a histogram
  • bins=<number of bins> Specify the number of bins in the histogram.
data['USA'].plot.hist(figsize=(20,6), bins=7)
plt.show()

Pie Chart

You create a Pie Chart as follows.

  • .plot.pie() Creates a Pie Chart
df = pd.Series(data=[3, 5, 7], index=['Data1', 'Data2', 'Data3'])
df.plot.pie()
plt.show()

You can add values counts to your Pie Chart

  • A simple chart of values above/below a threshold
  • .value_counts() Counts occurences of values in a Series (or DataFrame column)
  • A few arguments to .plot.pie()
    • colors=<list of colors>
    • labels=<list of labels>
    • title='<title>'
    • ylabel='<label>'
    • autopct='%1.1f%%' sets percentages on chart
(data['USA'] < 17.5).value_counts().plot.pie(colors=['r', 'g'], labels=['>=17.5', '<17.5'], title='CO2 per capita', autopct='%1.1f%%')
plt.show()

Scatter Plot

Assume we want to investigate if GDP per capita and CO2 per capita are correlated. Then a great way to get an idea about is by using a scatter plot.

Let’s try to do that. The data is available we just need to load it.

import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/co2_gdp_per_capita.csv', index_col=0)

data.plot.scatter(x='CO2 per capita', y='GDP per capita')
plt.show()

It seems there is some weak correlation – this can also be confirmed by calculating the correlation with data.corr() showing a 0.633178 correlation.

Step 4: Data Presentation

Data Presentation is about making data easy to digest.

Let’s try to make an example.

Assume we want to give a picture of how US CO2 per capita is compared to the rest of the world.

Preparation

  • Let’s take 2017 (as more recent data is incomplete)
  • What is the mean, max, and min CO2 per capital in the world
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0)

year = 2017
print(data.loc[year]['USA'])

This gives 14.8.

How can we tell a story?

  • US is above the mean
  • US is not the max
  • It is above 75%
ax = data.loc[year].plot.hist(bins=15, facecolor='green')

ax.set_xlabel('CO2 per capita')
ax.set_ylabel('Number of countries')
ax.annotate("USA", xy=(15, 5), xytext=(15, 30), 
            arrowprops=dict(arrowstyle="->",
                            connectionstyle="arc3"))

This is one way to tell a story.

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
Data Science

Master the Data Science Workflow Blueprint to Get Measurable Data Driven Impact

What will we cover?

In this tutorial we will cover the Data Science Workflow and…

  • Why Data Science?
  • Understand the Problem as a Data Scientist.
  • The Data Science Workflow
  • Explore it with a Student Grade Prediction problem.

We will use Python and pandas with our initial Data Science problem.

Part 1: Why Data Science?

Did you know you check your phone 58 times per day?

Let’s say you are awake 16 hours – that is, you check your phone every 17 minutes during all your waking hours.

Estimates approximate that 66% of all smartphone users are addicted to their phones.

Does that surprise you?

How do we know that?

Data.

We live in a world where you know that the above statements are possibly not wild guesses, there is data to confirm them.

This tutorial is not about helping your phone addiction – it is about Data Science.

With a world full of data you can learn just about anything, make your own analysis and understand the aspects better. You can help make data driven decisions, to avoid blind guesses.

This is one reason to love Data Science.

How did Data Science start?

Part 2: Understanding the problem in Data Science

The key to success in Data Science is understanding the problem. Get the right question.

What is the problem we try to solve? This will form the Data Science Problem.

Examples

  • Sales figure and call center logs: evaluate a new product
  • Sensor data from multiple sensors: detect equipment failure
  • Customer data + marketing data: better targeted marketing

Part of understanding the problem included to asses the situation – this will help you understand your context, your problem better.

In the end, it is all about defining the object of your Data Science research. What is the success criteria?

The key to a successful Data Science project is to understand the object and success criteria, this will guide you in your search to understand the research better.

Part 3: Data Science Workflow

Most get Data Science wrong!

At least, at first.

Deadly wrong!

The assume – not to blame them – that Data Science is about knowing the most tools to solve the problem.

This series of tutorials will teach you something different.

The key to a successful Data Scientist is to understand the Data Science Workflow.

Data Science Workflow

Looking at the above flow – you will realize, that most beginners only focus on a narrow aspect of it.

That is a big mistake – the real value is in step 5, where you use the insight to make measurable goals from data driven insights.

Let’s take an example of how a simple Data Science Workflow could be.

  • Step 1
    • Problem: Predict weather tomorrow
    • Data: Time series on Temperateture, Air pressure, Humidity, Rain, Wind speed, Wind direction, etc.
    • Import: Collect data from sources
  • Step 2
    • Explore: Data quality
    • Visualize: A great way to understand data
    • Cleaning: Handle missing or faulty data
  • Step 3
  • Step 4
    • Present: Weather forecast
    • Visualize: Charts, maps, etc.
    • Credibility: Inaccurate results, too high confidence, not presenting full findings
  • Step 5
    • Insights: What to wear, impact on outside events, etc.
    • Impact: Sales and weather forecast (umbrella, ice cream, etc.)
    • Main goal: This is what makes Data Science valuable

Now, while this looks straight forward – the can be many iterations back into a previous step. Even at step 5, you can consult the client and realize you need more data and start another iteration from step 1, to enrich the process again.

Part 4: Student Grade Prediction

To get started with a simple project, we will explore the Portuguese high school student dataset from Kaggle.

It consists of features and targets.

The features are column data for each student. That is, each studen as a row in the dataset, and each row has data for each of the features.

Features

The the target is what we want to predict from student data.

That is, given a row of features, can we predict the targets.

Target

Here we will look at a smaller problem.

Problem: Propose activities to improve G3 grades.

Our Goal

  • To guide the school how they helps students getting higher grades

Yes – we need to explore the data and get ideas on how to help the students to get higher grades.

Now, let’s explore our Data Science Workflow.

Step 1: Acquire

  • Explore problem
  • Identify data
  • Import data

Get the right questions

  • This forms the data science problem
  • What is the problem

We need to understand a bit about the context.

Understand context

  • Student age?
  • What is possible?
  • What is the budget?

We have an idea about these things, not exact figures, but we have an idea about the age (high school students). This tells us what kind of activities we should propose. If it were kids in age 8-10 years, we should propose something different.

What is possible – well, your imagination must guide you with your rational mind. Also, what is the budget – we cannot propose ideas which are too expensive for a normal high school budget.

Let’s get started with some code, to get acquainted with the data.

import pandas as pd

data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/student-mat.csv')

print(len(data))

We will see it has 395 students in the dataset.

print(data.head())

print(data.columns)

This will show the first 5 lines of the dataset as well as the columns. The columns contains the feature and targets.

Step 2: Prepare

  • Explore data
  • Visualize ideas
  • Cleaning data

This step is also about understand if data quality is as expected. We will learn a lot more about this later.

For now explore the data types of the columns.

print(data.dtypes)

This will print out the data types. We see some are integers (int64) others are objects (that is strings/text in this case).

school        object
sex           object
age            int64
address       object
famsize       object
Pstatus       object
Medu           int64
Fedu           int64
Mjob          object
Fjob          object
reason        object
guardian      object
traveltime     int64
studytime      int64
failures       int64
schoolsup     object
famsup        object
paid          object
activities    object
nursery       object
higher        object
internet      object
romantic      object
famrel         int64
freetime       int64
goout          int64
Dalc           int64
Walc           int64
health         int64
absences       int64
G1             int64
G2             int64
G3             int64
dtype: object

And if there are any missing values.

print(data.isnull().any())

The output below tells us (all the False values) that there is no missing data.

school        False
sex           False
age           False
address       False
famsize       False
Pstatus       False
Medu          False
Fedu          False
Mjob          False
Fjob          False
reason        False
guardian      False
traveltime    False
studytime     False
failures      False
schoolsup     False
famsup        False
paid          False
activities    False
nursery       False
higher        False
internet      False
romantic      False
famrel        False
freetime      False
goout         False
Dalc          False
Walc          False
health        False
absences      False
G1            False
G2            False
G3            False
dtype: bool

Step 3: Analyze

  • Feature selection
  • Model selection
  • Analyze data

We are interested to see what has impact on end grades (G3). We can use correlation for that.

For now, correlation is just a number saying if something is correlated or not.

A correlation number is between (including both) -1 and 1. If close to -1 or 1 (that is not close to 0), then it is correlated.

print(data.corr())
age          -0.161579
Medu          0.217147
Fedu          0.152457
traveltime   -0.117142
studytime     0.097820
failures     -0.360415
famrel        0.051363
freetime      0.011307
goout        -0.132791
Dalc         -0.054660
Walc         -0.051939
health       -0.061335
absences      0.034247
G1            0.801468
G2            0.904868
G3            1.000000
Name: G3, dtype: float64

This shows us to learnings.

First of all, the grades G1, G2, and G3 are highly correlated, while almost non of the others are.

Second, it only considers the numeric features.

But how can we use non-numeric features you might ask.

Let’s consider the feature higher (wants to take higher education (binary: yes or no)).

print(data.groupby('higher')['G3'].mean())

This gives.

higher
no      6.800
yes    10.608
Name: G3, dtype: float64

This shows that this is a good indicator of whether a student gets good or bad grades. That is, if we assume the questions were asked in the beginning at high school, you can say that students answering no will get 6.8, while students answering yes till get 10.6 on average (grades are in range 0 – 20).

That is a big indicator.

But how many are in each group?

You can get that by.

print(data.groupby('higher')['G3'].count())

Resulting in.

higher
no      20
yes    375
Name: G3, dtype: int64

Now, that is not many. But maybe this is good enough. Finding 20 students which we really can help improve grades.

Later we will learn more about standard deviation, but for now we leave our analysis at this.

Step 4: Report

  • Present findings
  • Visualize results
  • Credibility counts

This is about how to present our results. We have learned nothing visual yet, so we will keep it simple.

We cannot do much more than present the findings.

higher mean grades
no 6.800
yes 10.608

higher count
no 20
yes 375

I am sure you can make a nicer power point presentation than this.

Step 5: Actions

  • Use insights
  • Measure impact
  • Main goal

Now this is where we need to find ideas. We have identified 20 students, now we need to find activities that the high school can use to improve.

This is where I will let it be your ideas.

How can you measure?

Well, one way is to collect the same data each year and see if the activities have impact.

Now, you can probably do better than I did. Hence, I encourage you to play around with the dataset and find better indicators to get ideas to awesome activities.

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
Data Science

Learn Information Extraction with Skip-Gram Architecture

What will we cover?

  • What is Information Extraction
  • Extract knowledge from patterns
  • Word representation
  • Skip-Gram architecture
  • To see how words relate to each other (this is surprising)

What is Information Extraction?

Information Extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents (wiki).

Let’s try some different approaches.

Approach 1: Extract Knowledge from Patters

Given data knowledge that is fit together – then try to find patterns.

This is actually a powerful approach. Assume you know that Amazon was founded in 1992 and Facebook was founded in 2004.

A pattern could be be “When {company} was founded in {year},”

Let’s try this in real life.

import pandas as pd
import re

# Reading a knowledge base (here only one entry in the csv file)
books = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/books.csv', header=None)

# Convert to t a list
book_list = books.values.tolist()

# Read some content (here a web-page)
with open('https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/penguin.html') as f:
    corpus = f.read()

corpus = corpus.replace('\n', ' ').replace('\t', ' ')

# Try to look where we find our knowledge to find patters
for val1, val2 in book_list:
    print(val1, '-', val2)
    for i in range(0, len(corpus) - 100, 20):
        pattern = corpus[i:i + 100]
        if val1 in pattern and val2 in pattern:
            print('-:', pattern)

This gives the following.

1984 - George Orwell
-: ge-orwell-with-a-foreword-by-thomas-pynchon/">1984</a></h2>   <h2 class="author">by George Orwell</h
-: eword-by-thomas-pynchon/">1984</a></h2>   <h2 class="author">by George Orwell</h2>    <div class="de
-: hon/">1984</a></h2>   <h2 class="author">by George Orwell</h2>    <div class="desc">We were pretty c
The Help - Kathryn Stockett
-: /the-help-by-kathryn-stockett/">The Help</a></h2>   <h2 class="author">by Kathryn Stockett</h2>    <
-: -stockett/">The Help</a></h2>   <h2 class="author">by Kathryn Stockett</h2>    <div class="desc">Thi

This gives you an idea of some patterns.

prefix = re.escape('/">')
middle = re.escape('</a></h2>   <h2 class="author">by ')
suffix = re.escape('</h2>    <div class="desc">')

regex = f"{prefix}(.{{0,50}}?){middle}(.{{0,50}}?){suffix}"
results = re.findall(regex, corpus)

for result in results:
    print(result)

Giving the following pattern matches with new knowledge.

[('War and Peace', 'Leo Tolstoy'),
 ('Song of Solomon', 'Toni Morrison'),
 ('Ulysses', 'James Joyce'),
 ('The Shadow of the Wind', 'Carlos Ruiz Zafon'),
 ('The Lord of the Rings', 'J.R.R. Tolkien'),
 ('The Satanic Verses', 'Salman Rushdie'),
 ('Don Quixote', 'Miguel de Cervantes'),
 ('The Golden Compass', 'Philip Pullman'),
 ('Catch-22', 'Joseph Heller'),
 ('1984', 'George Orwell'),
 ('The Kite Runner', 'Khaled Hosseini'),
 ('Little Women', 'Louisa May Alcott'),
 ('The Cloud Atlas', 'David Mitchell'),
 ('The Fountainhead', 'Ayn Rand'),
 ('The Picture of Dorian Gray', 'Oscar Wilde'),
 ('Lolita', 'Vladimir Nabokov'),
 ('The Help', 'Kathryn Stockett'),
 ("The Liar's Club", 'Mary Karr'),
 ('Moby-Dick', 'Herman Melville'),
 ("Gravity's Rainbow", 'Thomas Pynchon'),
 ("The Handmaid's Tale", 'Margaret Atwood')]

Approach 2: Skip-Gram Architecture

One-Hot Representation

  • Representation word as a vector with a single 1, and with other values as 0
  • Maybe not useful to have with

Distributed Representation

  • representation of meaning distributed across multiple values

How to define words as vectors

  • Word is defined by what words suround it
  • Based on the context
  • What words happen to show up around it

word2vec

  • model for generating word vectors

Skip-Gram Architecture

  • Neural network architecture for predicting context words given a target word
    • Given a word – what words show up around it in a context
  • Example
    • Given target word (input word) – train the network of which context words (right side)
    • Then the weights from input node (target word) to hidden layer (5 weights) give a representation
    • Hence – the word will be represented by a vector
    • The number of hidden nodes represent how big the vector should be (here 5)
  • Idea is as follows
    • Each input word will get weights to the hidden layers
    • The hidden layers will be trained
    • Then each word will be represented as the weights of hidden layers
  • Intuition
    • If two words have similar context (they show up the same places) – then they must be similar – and they have a small distance from each other representations
import numpy as np
from scipy.spatial.distance import cosine

with open('https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/words.txt') as f:
    words = {}
    lines = f.readlines()
    for line in lines:
        row = line.split()
        word = row[0]
        vector = np.array([float(x) for x in row[1:]])
        words[word] = vector

def distance(word1, word2):
    return cosine(word1, word2)

def closest_words(word):
    distances = {w: distance(word, words[w]) for w in words}
    return sorted(distances, key=lambda w: distances[w])[:10]

This will amaze you. But first let’s see what it does.

distance(words['king'], words['queen'])

Gives 0.19707422881543946. Some number that does not give much sense.

distance(words['king'], words['pope'])

Giving 0.42088794105426874. Again, not much of value.

closest_words(words['king'] - words['man'] + words['woman'])

Giving.

['queen',
 'king',
 'empress',
 'prince',
 'duchess',
 'princess',
 'consort',
 'monarch',
 'dowager',
 'throne']

Wow!

Why do I say wow?

Well, king – man + woman becomes queen.

If that is not amazing?

Want to learn more?

This is part of a FREE 10h Machine Learning course with Python.

  • 15 video lessons – which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).

Implement a Term Frequency by Inverse Document Frequency (TF-IDF) with NLTK

What will we cover?

  • Learn what Information Retrieval is
  • Topic modeling documents
  • How to use Term Frequency and understand the limitations
  • Implement Term Frequency by Inverse Document Frequency (TF-IDF)

Step 1: What is Information Retrieval (IR)?

The task of finding relevant documents in response to a user query. Web search engines are the most visible IR applications (wiki).

Topic modeling is a model for discovering the topics for a set of documents, e.g., it can provide us with methods to organize, understand and summarize large collections of textual information.

Topic modeling can be described as a method for finding a group of words that best represent the information.

Step 2: Approach 1: Term Frequency

Term Frequency is the number of times a term occurs in a document is called its term frequency (wiki).

tf(𝑡,𝑑)=𝑓𝑡,𝑑: The number of time term 𝑡 occurs in document 𝑑.

There are other ways to define term frequency (see wiki).

Let’s try to write some code to explore this concept.

To follow this code you need to download the files here from here: GitHub link. You can download them as a zip file from here: Zip-download.

import os
import nltk
import math

corpus = {}

# Count the term frequencies
for filename in os.listdir('files/holmes/'):
    with open(f'files/holmes/{filename}') as f:
        content = [word.lower() for word in nltk.word_tokenize(f.read()) if word.isalpha()]
        
        freq = {word: content.count(word) for word in set(content)}
        
        corpus[filename] = freq

for filename in corpus:
    corpus[filename] = sorted(corpus[filename].items(), key=lambda x: x[1], reverse=True)

for filename in corpus:
    print(filename)
    for word, score in corpus[filename][:5]:
        print(f'  {word}: {score}')

This will output (only sample output).

speckled.txt
  the: 600
  and: 281
  of: 276
  a: 252
  i: 233
face.txt
  the: 326
  i: 298
  and: 226
  to: 185
  a: 173

We see that the words most used in each documents are so called stop-word.

  • words that have little meaning on their own (wiki)
  • Examples: am, by, do, is, which, ….
  • Student exercise: Remove function words and see result (HINT: nltk has a list of stopwords)

What you will discover if you remove all stop-words, then you will still not get anything very useful. There are some words that are just more common.

Step 3: Approach 2: TF-IDF

TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. (wiki)

Inverse Document Frequency

  • Measure of how common or rare a word is across documents

idf(𝑡,𝐷)=log𝑁|𝑑∈𝐷:𝑡∈𝑑|=log(Total Documents / Number of Documents Containing “term”)

  • 𝐷: All documents in the corpus
  • 𝑁: total number of documents in the corpus 𝑁=|𝐷|

TF-IDF

Ranking of what words are important in a document by multiplying Term Frequencey (TF) by Inverse Document Frequency (IDF)

tf-idf(𝑡,𝑑)=tf(𝑡,𝑑)⋅idf(𝑡,𝐷)

Let’s make a small example.

doc1 = "This is the sample of the day".split()
doc2 = "This is another sample of the day".split()

corpus = [doc1, doc2]

tf1 = {word: doc1.count(word) for word in set(doc1)}
tf2 = {word: doc2.count(word) for word in set(doc2)}

term = 'another'
ids = 2/sum(term in doc for doc in corpus)

tf1.get(term, 0)*ids, tf2.get(term, 0)*ids

Want to learn more?

If you watch the YouTube video you will see how to do it for a bigger corpus of files.

This is part of a FREE 10h Machine Learning course with Python.

  • 15 video lessons – which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).

Naive Bayes’ Rule for Sentiment Classification with Full Explanation

What will we cover?

  • What is Text Categorization
  • Learn about the Bag-of-Words Model
  • Understand Naive Bayes’ Rule
  • How to use Naive Bayes’ Rule for sentiment classification (text categorization)
  • What problem smoothing solves

Step 1: What is Text Categorization?

Text categorization (a.k.a. text classification) is the task of assigning predefined categories to free-text documents. It can provide conceptual views of document collections and has important applications in the real world.

http://www.scholarpedia.org/article/Text_categorization

Exampels of Text Categorization includes.

  • Inbox vs Spam
  • Product review: Positive vs Negtive review

Step 2: What is the Bag-of-Words model?

We have already learned from Context-Free Grammars, that understanding the full structure of language is not efficient or even possible for Natural Language Processing. One approach was to look at trigrams (3 consecutive words), which can be used to learn about the language and even generate sentences.

Another approach is the Bag-of-Words model.

The Bag-of-Words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

https://en.wikipedia.org/wiki/Bag-of-words_model

What does that all mean?

  • The structure is not important
  • Works well to classify
  • Example could be
    • love this product.
    • This product feels cheap.
    • This is the best product ever.

Step 3: What is Naive Bayes’ Classifier?

Naive Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong (naïve) independence assumptions between the features (wiki).

Bayes’ Rule Theorem

Describes the probability of an event, based on prior knowledge of conditions that might be related to the event (wiki).

𝑃(𝑏|𝑎) = 𝑃(𝑎|𝑏)𝑃(𝑏) / 𝑃(𝑎)

Explained with Example

What is the probability that the sentiment is positive giving the sentence “I love this product”. This can be expressed as follows.

𝑃(positive|”I love this product”)=𝑃(positive|”I”, “love”, “this”, “product”)

Bayes’s Rule implies it is equal to

𝑃(“I”, “love”, “this”, “product”|positive)𝑃(positive) / 𝑃(“I”, “love”, “this”, “product”)

Or proportional to

𝑃(“I”, “love”, “this”, “product”|positive)𝑃(positive)

The ‘Naive‘ part we use this to simplify

𝑃(positive)𝑃(“I”|positive)𝑃(“love”|positive)𝑃(“this”|positive)𝑃(“product”|positive)

Ant then we have that.

𝑃(positive) = number of positive samples number of samples.

𝑃(“love”|positive) = number of positive samples with “love”number of positive samples.

Let’s try a more concrete example.

𝑃(positive)𝑃(“I”|positive)𝑃(“love”|positive)𝑃(“this”|positive)𝑃(“product”|positive) = 0.47∗0.30∗0.40∗0.28∗0.25=0.003948

𝑃(negative)𝑃(“I”|negative)𝑃(“love”|negative)𝑃(“this”|negative)𝑃(“product”|negative)=0.53∗0.20∗0.05∗0.42∗0.28 = 0.00062328

Calculate the likelyhood

“I love this product” is positive: 0.00394 / (0.00394 + 0.00062328) = 86.3%

“I love this product” is negative: 0.00062328 / (0.00394 + 0.00062328) = 13.7%

Step 4: The Problem with Naive Bayes’ Classifier?

Problem

If a word never showed up in a sentence, then this will result in a probability of zero. Say, in the above example that the word “product” was not represented in a positive sentence. This would imply that the probability P(“product” | positive) = 0, which would imply that the calculations for “I love this product” is positive would be 0.

There are different approaches to deal with this problem.

Additive Smoothing

Adding a value to each value in the distribution to smooth the data. This is straight forward, this ensures that even if the word “product” never showed up, then it will not create a 0 value.

Laplace smoothing

Adding 1 to each value in the distribution. This is just a concrete example of adding 1 to it.

Step 5: Use NLTK to classify sentiment

We already introduced the NLTK, which we will use here.

import nltk
import pandas as pd

data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/sentiment.csv')

def extract_words(document):
    return set(
        word.lower() for word in nltk.word_tokenize(document)
        if any(c.isalpha() for c in word)
    )

words = set()

for line in data['Text'].to_list():
    words.update(extract_words(line))

features = []
for _, row in data.iterrows():
    features.append(({word: (word in row['Text']) for word in words}, row['Label']))

classifier = nltk.NaiveBayesClassifier.train(features)

This creates a classifier (based on a small dataset, don’t expect magic).

To use it, try the following code.

s = input()

feature = {word: (word in extract_words(s)) for word in words}

result = classifier.prob_classify(feature)

for key in result.samples():
    print(key, result.prob(key))

Example could be if you input “this was great”.

this was great
 Negative 0.10747100603951745
 Positive 0.8925289939604821

Want to learn more?

If you followed the video you would also be introduced to a project where we create a sentiment classifier on a big twitter corpus.

This is part of a FREE 10h Machine Learning course with Python.

  • 15 video lessons – which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).

How to use Natural Language Processing for Trigrams

What will we cover?

  • How the simple syntax of language can be parsed
  • What Context-Free Grammar (CFG) is
  • Use it to parse text
  • Understand text in trigrams
  • A brief look at Markov Chains
  • See how it can be used to generate predictions

Step 1: What is Natural Language Processing?

Natural language processing (NLP) is a subfield of linguisticscomputer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of “understanding” the contents of documents, including the contextual nuances of the language within them.

https://en.wikipedia.org/wiki/Natural_language_processing

Simply said, NLP is automatic computational processing of human language.

This includes.

  • Algorithms that take human written language as input
  • Algorithms that produce natural text

And some examples include.

  • Automatic summarization
  • Language identification
  • Translation

Step 2: What is Context-Free Grammar (CFG)?

What is a Syntax?

One basic description of a language’s syntax is the sequence in which the subject, verb, and object usually appear in sentences.

What is a Formal Grammar?

A system of rules for generating sentences in a language and a grammar is usually thought of as a language generator (wiki).

What is a Context-Free Grammar (CFG)?

A formal grammar is “context free” if its production rules can be applied regardless of the context of a nonterminal (wiki).

Step 3: How to use NLTK and see the Challenge with CFG

What is NLTK?

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources.

You can install it by the following command.

pip install nltk

Notice, that you can do that inside from you JuPyter Notebook with this command.

!pip install nltk

Let’s write a CFG and understand the challenge working with language like that.

import nltk

grammar = nltk.CFG.fromstring("""
    S -> NP VP

    NP -> D N | N
    VP -> V | V NP

    D -> "the" | "a"
    N -> "she" | "city" | "car"
    V -> "saw" | "walked"    
""")

parser = nltk.ChartParser(grammar)

sentence = input().split()

for tree in parser.parse(sentence):
    tree.pretty_print()

If you run that code and type: she saw a car then you will get the following.

Think about CFG’s this way – if you are a computer, yes, you can generate all these trees representing the CFG – but there is a challenge.

You need to encode all possibilities. That is, the above grammar only understand the encoded words.

To have a full language grammar, it becomes very complex – or should we say – impossible.

What to do then?

Step 4: Use N-grams to understand language

The idea behind n-grams is to understand a small subset of the language. Not to focus on the bigger picture, but just a small subset of it.

You could set up as follows.

  • 𝑛-gram
    • a contiguous sequence of 𝑛n items from a sample text
  • Word 𝑛-gram
    • a contiguous sequence of 𝑛n words from a sample text
  • unigram
    • 1 items in sequence
  • bigram
    • 2 items in sequence
  • trigram
    • 3 items in sequence

We will focus on 3-grams – and the reason for that is if you need 4-grams or above, then you need a lot of text to make it useful.

Again, a trigram is taking 3-word contexts and looking at that isolated.

Let’s try to work with that.

Step 5: Word Tokenization

Word Tokenization is the task of splitting a sequence of words into tokens. This makes further processing easier.

Notice, we need to consider commas, punctuations etc.

To follow this code you need to download the files here from here: GitHub link. You can download them as a zip file from here: Zip-download.

Here we read all the content and tokenize it.

import os
from collections import Counter

# You need to download this
nltk.download('punkt')

content = []
for filename in os.listdir('files/holmes/'):
    with open(f'files/holmes/{filename}') as f:
        content.append(f.read())

corpus = []
for item in content:
    corpus.extend([word.lower() for word in nltk.word_tokenize(item) if any(c.isalpha() for c in word)])

Now we have all the tokens in the corpus.

Step 6: Generating trigrams from the corpus

Now it is straight forward to generate trigrams from the corpus.

ngrams = Counter(nltk.ngrams(corpus, 3))

What to use it for?

Well, you can look for which 3 words are most likely to in a sequence.

for ngram, freq in ngrams.most_common(10):
    print(f'{freq}: {ngram}')

Giving the following output.

80: ('it', 'was', 'a')
71: ('one', 'of', 'the')
65: ('i', 'think', 'that')
59: ('out', 'of', 'the')
55: ('that', 'it', 'was')
55: ('that', 'he', 'had')
55: ('there', 'was', 'a')
55: ('that', 'he', 'was')
52: ('it', 'is', 'a')
49: ('i', 'can', 'not')

First time I saw that, I don’t think I really appreciated the full aspect of that. But actually, you can learn a lot from that. If you look into the project (see YouTube video), then you are will see you can predict who is the person behind a Twitter account.

Yes, that is right. You will be surprised.

Step 7: What is Markov Models

What is the next step?

Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous even (wiki)

That is exactly the next step of what we did before.

Given any two words, then we have created probabilities of the next word.

This can be done by using the markovify library. Install it as follows.

pip install markovify

Then you can create an example like this.

import markovify

with open('https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/shakespeare.txt') as f:
    text = f.read()

model = markovify.Text(text)
model.make_sentence()

This will generate a random sentence from that idea.

'In the wars; defeat thy favor with an ordinary pitch, Who else but I, his forlorn duchess, Was made much poorer by it; but first, how get hence.'

Maybe not that good.

Want to learn more?

This is part of a FREE 10h Machine Learning course with Python.

  • 15 video lessons – which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).

Get Started with Recurrent Neural Network (RNN) with Tensorflow

What will we cover?

  • Understand Recurrent Neural Network (RNN)
  • Build a RNN on a timeseries
  • Hover over the theory of RNN (LSTM cells)
  • Use the MinMaxScaler from sklearn.
  • Create a RNN model with tensorflow
  • Applying the Dropout techniques.
  • Predict stock prices and make weather forecast using RNN.

Step 1: Feed-forward vs Recurrent Neural Network

Neural Network that has connection only in one direction is called Feed-Forward Neural Network (Examples: Artificial Neural Network, Deep Neural Network, and Convolutional Neural Network).

A Recurrent Neural Network is a Neural Network that generates output that feeds back into its own inputs. This enables it to do one-to-many and many-to-many relationship (not possible for feed-forward neural networks).

An example of one-to-many is a network that can generate sentences (while feed-forward neural network can only generate “words” or fixed sets of outputs).

Another example is working with time-series data, which we will explore in this tutorial.

A Recurrent Neural Network can be illustrated as follows.

Examples of Recurrent Neural Network includes also.

  • Google translate
  • Voice recognition
  • Video copy right violation

Step 2: Is RNN too complex to understand?

Recurrent Neural Network (RNN) is complex – but luckily – it is not needed to understand in depth.

You don’t need to understand everything about the specific architecutre of an LSTM cell […] just that LSTM cell is meant to allow past information to be reinjected at a later time.

Quote of the author of Keras (Francios Chollet)

Let’s just leave it that and get started.

Step 3: RNN predicting stock price

For the purpose of this tutorial we will use Apple stock price and try to make a RNN to predict stock stock price the day after.

For that we will use this file of historic Apple stock prices here. You do not need to download it, we will use it directly in the code.

import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dropout
import matplotlib.pyplot as plt

file_url = 'https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/aapl.csv'
data = pd.read_csv(file_url, parse_dates=True, index_col=0)

# Create a train and test set
data_train = data.loc['2000':'2019', 'Adj Close'].to_numpy()
data_test = data.loc['2020', 'Adj Close'].to_numpy()

# Use the MinMaxScaler to scale the data
scaler = MinMaxScaler()
data_train = scaler.fit_transform(data_train.reshape(-1, 1))
data_test = scaler.transform(data_test.reshape(-1, 1))

# To divide data into x and y set
def data_preparation(data):
    x = []
    y = []
    
    for i in range(40, len(data)):
        x.append(data[i-40:i, 0])
        y.append(data[i])
        
    x = np.array(x)
    y = np.array(y)
    
    x = x.reshape(x.shape[0], x.shape[1], 1)
    
    return x, y

x_train, y_train = data_preparation(data_train)
x_test, y_test = data_preparation(data_test)

# Create the model
model = Sequential()
model.add(LSTM(units=45, return_sequences=True, input_shape=(x_train.shape[1], 1)))
model.add(LSTM(units=45, return_sequences=True))
model.add(LSTM(units=45))
model.add(Dense(units=1))

# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])

# Train the model
model.fit(x_train, y_train, epochs=5, batch_size=32)

# Predict with the model
y_pred = model.predict(x_test)

# Unscale it
y_unscaled = scaler.inverse_transform(y_pred)

# See the prediction accuracy
fig, ax = plt.subplots()
y_real = data.loc['2020', 'Adj Close'].to_numpy()
ax.plot(y_real[40:])
ax.plot(y_unscaled)
alt.show()

Resulting in.

This looks more like a moving average of the price and does not to a particular good job.

I am not surprised, as predicting stock prices is not anything easy. If you could do it with a simple model like this, then you would become rich really fast.

Want to learn more?

This is part of a FREE 10h Machine Learning course with Python.

  • 15 video lessons – which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).

PyTorch Model to Detect Handwriting for Beginners

What will we cover?

  • What is PyTorch
  • PyTorch vs Tensorflow
  • Get started with PyTorch
  • Work with image classification

Step 1: What is PyTorch?

PyTorch is an optimized tensor library for deep learning using GPUs and CPUs.

What does that mean?

Well, PyTorch is an open source machine learning library and is used for computer vision and natural language processing. It is primarily developed by Facebook’s AI Research Lab.

Step 2: PyTorch and Tensorflow

Often people worry about which framework to use not to waste time.

You probably do the same – but don’t worry, if you use either PyTorch or Tensorflow, then you are on the right track. They are the most popular Deep Learning frameworks, if you learn one, then you will have an easy time to switch to the other later.

PyTorch was release in 2016 by Facebook’s Research Lab, while Tensorflow was released in 2015 by Google Brain team.

Both are good choices for Deep Learning.

Step 3: PyTorch and prepared datasets

PyTorch comes with a long list of prepared datasets and you can see them all here.

We will look at the MNIST dataset for handwritten digit-recognition.

In the video above we also look at the CIFAR10 data set, which consist of 32×32 images of 10 classes.

You can get a dataset by using torchvision.

from torchvision import datasets

data_path = 'downloads/'
mnist = datasets.MNIST(data_path, train=True, download=True)

Step 4: Getting the data and prepare data

First we need to get the data and prepare them by turning them into tensors and normalize them.

Transforming and Normalizing

  • Images are PIL objects in the MNIST dataset
  • You need to be transformed to tensor (the datatype for Tensorflow)
    • torchvision has transformations transform.ToTensor(), which turns NumPy arrays and PIL to Tensor
  • Then you need to normalize images
    • Need to determine the mean value and the standard deviation
  • Then we can apply nomalization
    • torchvision has transform.Normalize, which takes mean and standard deviation
from torchvision import datasets
from torchvision import transforms
import torch
import torch.nn as nn
from torch import optim
import matplotlib.pyplot as plt

data_path = 'downloads/'
mnist = datasets.MNIST(data_path, train=True, download=True)
mnist_val = datasets.MNIST(data_path, train=False, download=True)

mnist = datasets.MNIST(data_path, train=True, download=False, transform=transforms.ToTensor())

imgs = torch.stack([img_t for img_t, _ in mnist], dim=3)

print('get mean')
print(imgs.view(1, -1).mean(dim=1))

print('get standard deviation')
print(imgs.view(1, -1).std(dim=1))

Then we can use those values to make the transformation.

mnist = datasets.MNIST(data_path, train=True, download=False, 
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307),
                                               (0.3081))]))

mnist_val = datasets.MNIST(data_path, train=False, download=False, 
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307),
                                               (0.3081))]))

Step 5: Creating and testing a Model

The model we will use will be as follows.

We can model that as follows.

input_size = 784 # ?? 28*28
hidden_sizes = [128, 64]
output_size = 10

model = nn.Sequential(nn.Linear(input_size, hidden_sizes[0]),
                     nn.ReLU(),
                     nn.Linear(hidden_sizes[0], hidden_sizes[1]),
                     nn.ReLU(),
                     nn.Linear(hidden_sizes[1], output_size),
                     nn.LogSoftmax(dim=1))

Then we can train the model as follows

train_loader = torch.utils.data.DataLoader(mnist, batch_size=64,
                                           shuffle=True)

optimizer = optim.SGD(model.parameters(), lr=0.01)
loss_fn = nn.NLLLoss()

n_epochs = 10
for epoch in range(n_epochs):
    for imgs, labels in train_loader:
        optimizer.zero_grad()

        batch_size = imgs.shape[0]
        output = model(imgs.view(batch_size, -1))

        loss = loss_fn(output, labels)

        loss.backward()

        optimizer.step()
    print("Epoch: %d, Loss: %f" % (epoch, float(loss)))

And finally, test our model.

val_loader = torch.utils.data.DataLoader(mnist_val, batch_size=64,
                                           shuffle=True)


correct = 0
total = 0
with torch.no_grad():
    for imgs, labels in val_loader:
        batch_size = imgs.shape[0]
        outputs = model(imgs.view(batch_size, -1))
        _, predicted = torch.max(outputs, dim=1)
        total += labels.shape[0]
        correct += int((predicted == labels).sum())
print("Accuracy: %f", correct / total)

Reaching an accuracy of 96.44%

Want to learn more?

Want better results? Try using a CNN model.

This is part of a FREE 10h Machine Learning course with Python.

  • 15 video lessons – which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).
Exit mobile version