Batch Process Face Detection in 3 Steps with OpenCV

What will you learn?

You want to extract or identify faces on a bunch of images, but how do you do that without becoming a Machine Learning expert?

Here you will learn how to do it without any Machine Learning skills.

Many Machine Learning things are done so often you can just use pre-built Machine Learning models. Here you will learn the task of finding faces and extract locations of them.

Step 1: Pre-built OpenCV models to detect faces

When you think of detecting faces on images, you might get scared. I’ve been there, but there is nothing to be scared of, because some awesome people already did all the hard work for you.

They built a model, which can detect faces on images.

All you need to do, is, to feed it with images and let it do all the work.

This boils down to the following.

  1. We need to know what model to use.
  2. How to feed it with images.
  3. How to use the results it brings and convert it to something useful.

This is what the rest of this tutorial will teach you.

We will use OpenCV and their pre-built detection model haarcascade.

First you should download and install the requirements.

This can be done either by cloning this repository.

Or download the files as a zip-file and unpack them.

You should install opencv-python library. This can be done as follows.

pip install opencv-python

You can also use the requirements.txt file to install it.

pip install -r requirements.txt

Step 2: Detect a face

We will use this image to start with.

The picture is part of the repository from step 1.

Now let’s explore the code in face_detection.py.

# importing opencv
import cv2
# using cv2.CascadeClassifier
# See https://docs.opencv.org/3.4/db/d28/tutorial_cascade_classifier.html
# See more Cascade Classifiers https://github.com/opencv/opencv/tree/4.x/data/haarcascades
face_cascade = cv2.CascadeClassifier("haarcascade_frontalface_default.xml")
img = cv2.imread("sample_images/sample-00.jpg")
# changing the image to gray scale for better face detection
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
faces = face_cascade.detectMultiScale(
    gray,
    scaleFactor=2,  # Big reduction
    minNeighbors=5  # 4-6 range
)
# drawing a rectangle to the image.
# for loop is used to access all the coordinates of the rectangle.
for x, y, w, h in faces:
    cv2.rectangle(img, (x, y), (x+w, y+h), (0, 255, 0), 5)
# showing the detected face followed by the waitKey method.
cv2.imshow("image", img)
cv2.waitKey(0)
cv2.destroyAllWindows()

First notice, that the opencv-python package is imported by import cv2.

Then also, notice we need to run this code in from where the file haarcascade_frontalface_default.xml is located.

After that you will read the image into the variable img. Notice, that this assumes you run the file like they are structure in the GitHub (downloaded in step 1).

When you work with images, you often do not need the level of details given in it. Therefore, the first thing we doit to gray scale the image.

After we have gray scaled the image we use the face detection model (face_cascade.detectMultiScale).

This will give the result faces, which is an iterable.

We want to insert rectangles of the images in the original image (not the gray scaled).

Finally, we show the image and wait until someone hist a key.

Step 3: Batch process face detection

To batch process face detection, a great idea is to build a class to do the face detections. It could be designed in many ways. But the idea is to decouple the filename processing from the actual face detection.

One way to do it could be as follows.

import os
import cv2

class FaceDetector:
    def __init__(self, scale_factor=2, min_neighbors=5):
        self.face_cascade = cv2.CascadeClassifier("haarcascade_frontalface_default.xml")
        self.scale_factor = scale_factor
        self.min_neighbors = min_neighbors
        self.img = None
    def read_image(self, filename):
        self.img = cv2.imread(filename)
    def detect_faces(self):
        gray = cv2.cvtColor(self.img, cv2.COLOR_BGR2GRAY)
        faces = self.face_cascade.detectMultiScale(
            gray,
            scaleFactor=self.scale_factor,
            minNeighbors=self.min_neighbors
        )
        # drawing a rectangle to the image.
        # for loop is used to access all the coordinates of the rectangle.
        for x, y, w, h in faces:
            cv2.rectangle(self.img, (x, y), (x + w, y + h), (0, 255, 0), 5)
        return self.img

face_detector = FaceDetector()
for filename in os.listdir('sample_images/'):
    print(filename)
    face_detector.read_image(f'sample_images/{filename}')
    img = face_detector.detect_faces()
    cv2.imshow("image", img)
    cv2.waitKey(0)
    cv2.destroyAllWindows()

If you want to write the files to storage with face detections, you should exchange the the line cv2.imshow with the following.

    cv2.imwrite(filename, img)

Want to learn more Machine Learning?

You will surprised how easy Machine Learning has become. There are many great and easy to use libraries. All you need to learn is how to train them and use them to predict.

If you want to learn more?

Then I created this 10 hours free Machine Learning course, which will cover all you need.

  • 15 video lessons – which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).

How to make a Formatted Word Cloud in 7 Steps

What will you learn?

At the end of this tutorial you will know how to make a formatted word cloud with Python like this one.

Step 1: Read content

The first thing you need is some content to make to make word frequency on.

In this example we will use the books of Sherlock Holmes – which are available in my GitHub here.

You can clone the repo or just download the full repository as a zip file from the green Code dropdown menu. Then you should see a folder with all the texts of holmes.

We will read them here.

import os
content = []
for filename in os.listdir('holmes/'):
    with open(f'holmes/{filename}') as f:
        content.append(f.read())

Of course you can have any other set of text files.

The result in content is a list of the full content of text of each file. Each file will be raw text with new lines.

Step 2: Corpus in lower case

Here we will use the NLTK toolkit tokenize to get each word.

import nltk
corpus = []
for item in content:
    corpus.extend([word.lower() for word in nltk.word_tokenize(item)])

This creates a list of each word in lower case.

We use list comprehension. If you are new to that check this tutorial.

Step 3: Remove stop words

Stop words are the words with no or little meaning. We do not want to include them in our word cloud, as they are common and take up a lot of space.

from nltk.corpus import stopwords
corpus = [w for w in corpus if w not in stopwords.words('english')]

Again we use list comprehension.

Step 4: Keep alphanumeric words

This can also be done by list comprehension.

corpus = [w for w in corpus if w.isalnum()]

Step 5: Lemmatize words

To lemmatize words is to get them in their root form. We don’t want to have the same word in different forms. We only need it in the basic form. This is what lemmatizing does.

from nltk.corpus import wordnet 
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)
corpus = [WordNetLemmatizer().lemmatize(w, get_wordnet_pos(w)) for w in corpus]

Again we use list comprehension to achieve the result.

Step 6: Create a Word Cloud

First we create a simple word cloud.

from wordcloud import WordCloud
unique_string = " ".join(corpus)
wordcloud = WordCloud(width = 1000, height = 500).generate(unique_string)
wordcloud.to_file("word_cloud.png")

This will create an image word_cloud.png similar to this one.

Step 7: Create a formatted Word Cloud

To do that we need a mask. We will use the cloud.png from the repository.

import numpy as np
from PIL import Image
unique_string_v2=(" ").join(corpus)
cloud_mask = np.array(Image.open("cloud.png"))
wordcloud = WordCloud(width = 1000, height = 500, background_color="white",
               mask=cloud_mask, max_words=5000, contour_width=2, contour_color='black')
wordcloud.generate(unique_string_v2)
wordcloud.to_file("word_cloud_masked.png")

This will generate a picture like this one.

Full code

You can get the full code from my GitHub repository.

If you clone it you get the full code as well as all the files you need.

import nltk
from nltk.corpus import stopwords
import os
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
from wordcloud import WordCloud
import numpy as np
from PIL import Image
nltk.download('wordnet')
nltk.download('omw-1.4')
content = []
for filename in os.listdir('holmes/'):
    with open(f'holmes/{filename}') as f:
        content.append(f.read())
corpus = []
for item in content:
    corpus.extend([word.lower() for word in nltk.word_tokenize(item)])
corpus = [w for w in corpus if w not in stopwords.words('english')]
corpus = [w for w in corpus if w.isalnum()]

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

corpus = [WordNetLemmatizer().lemmatize(w, get_wordnet_pos(w)) for w in corpus]

unique_string = " ".join(corpus)
wordcloud = WordCloud(width = 1000, height = 500).generate(unique_string)
wordcloud.to_file("word_cloud.png")
unique_string_v2 = " ".join(corpus)
cloud_mask = np.array(Image.open("cloud.png"))
wordcloud = WordCloud(width=1000, height=500, background_color="white",
                      mask=cloud_mask, max_words=5000, contour_width=2, contour_color='black')
wordcloud.generate(unique_string_v2)
wordcloud.to_file("word_cloud_masked.png")

Master Data Visualization for 3 Purposes as a Data Scientist with Full Code Examples

What will we cover?

We will investigate the 3 main purposes of Data Visualization as a Data Scientist.

  • Data Quality: We will demonstrate with examples how you can identify faulty and wrongly formatted data with visualization.
  • Data Exploration: This will teach you how to understand data better with visualization.
  • Data Presentation: Here we explore the main purpose new users think of Data Visualization, to present the findings. This will focus on how to use Data Visualization to confirm your key findings.

But first, we will understand the power of Data Visualization – understand why it is such a powerful tool to master.

Step 1: The Power of Data Visualization

Let’s consider some data – you get it by the following code.

import pandas as pd
sample = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/sample_corr.csv')
print(sample)

The output is – can you spot any connection?

Let’s try to visualize the same data.

Matplotlib is an easy to use visualization library for Python.

import matplotlib.pyplot as plt
sample.plot.scatter(x='x', y='y')
plt.show()

Giving the following output.

And here it is easy to spot that there is some kind of correlation. And actually, you would be able to absorb this connection no matter how many datapoint was there, given the data has the same nature.

What Data Visualization gives

  • Absorb information quickly
  • Improve insights
  • Make faster decisions

Step 2: Data Quality with Visualization

Data Quality is something many like to talk about – but unfortunately there is no precise universal definition of it and is rather context specific.

That said, it is a concept you need to understand as a Data Scientist.

In general Data Quality is about (not only)

  • Missing data (often represented as NA-values)
  • Wrong data (data which cannot be used)
  • Different scaled data (e.g. data in different units without being specified)

Sometimes Data Quality is mixed with aspects witch included Data Wrangling – that could be extracting values from string representations.

Data Quality requires that you know something about the data.

Imagine we are considering a dataset of human heights in centimers. Then we can check that in a histogram.

data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/sample_height.csv')
data.plot.hist()
plt.show()

We immediately realize that some of the data is not correct.

Then we can get the data below 50 cm as follows.

print(data[data['height'] < 50])

Looking at that data, you might realize that could be data inserted in meters and not centimeters. This happens often if data is entered by humans, some might wrongly type in meters and not centimers.

This could mean, that the data is valid, it just need re-scaling.

Another example is checking for outliers – in this case wrong data.

Consider this dataset of human age.

data = pd.read_csv('files/sample_age.csv')
data.plot.hist()
plt.show()

And you see someone with age around 300.

Similarly, you can get it with data[data[‘age’] > 150] and see one of age 314 years.

As you realize that Data Visualization helps you fast to get an idea of Data Quality.

Step 3: Data Exploration with Data Visualization

We already get an idea that Data Visualization helps us to absorb information quickly, but not only that, it also helps us to improve insights in the data enabling us to make faster decisions.

Now we will consider data from the World Bank (The World Bank is a great source of datasets).

Let’s consider the dataset EN.ATM.CO2E.PC.

Now let’s consider some typical Data Visualizations. To get started, we need to read the data.

import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0)
print(data.head)

We see that each year has a row, and each column represent a country (we only see part of them here).

Simple plot

To create a simple plot you can apply the following on a DataFrame.

data['USA'].plot()
plt.show()

A great thing about this is how simple it is to create.

Adding a title and labels is straight forward.

  • title='Tilte' adds the title
  • xlabel='X label' adds or changes the X-label
  • ylabel='X label' adds or changes the Y-label
data['USA'].plot(title='CO2 per capita in USA', ylabel='CO2 per capita')
plt.show()

Another thing you can do is adding ranges to the axis.

  • xlim=(min, max) or xlim=min Sets the x-axis range
  • ylim=(min, max) or ylim=min Sets the y-axis range
data['USA'].plot(title='CO2 per capita in USA', ylabel='CO2 per capita', ylim=0)
plt.show()

If you want to compare two columns in the DataFrame you can do it as follows.

data[['USA', 'WLD']].plot(ylim=0)
plt.show()

If you want to set the figure size of the plot, this can be done as follows.

  • figsize=(width, height) in inches
data[['USA', 'DNK', 'WLD']].plot(ylim=0, figsize=(20,6))
plt.show()

Bar Plot

You can create a bar plot as follows.

  • .plot.bar() Create a bar plot
data['USA'].plot.bar(figsize=(20,6))
plt.show()

Bar plot with two columns.

data[['USA', 'WLD']].plot.bar(figsize=(20,6))
plt.show()

Plot a range.

  • .loc[from:to] apply this on the DataFrame to get a range (both inclusive)
data[['USA', 'WLD']].loc[2000:].plot.bar(figsize=(20,6))
plt.show()

Histograms

You can create a histogram as follows.

  • .plot.hist() Create a histogram
  • bins=<number of bins> Specify the number of bins in the histogram.
data['USA'].plot.hist(figsize=(20,6), bins=7)
plt.show()

Pie Chart

You create a Pie Chart as follows.

  • .plot.pie() Creates a Pie Chart
df = pd.Series(data=[3, 5, 7], index=['Data1', 'Data2', 'Data3'])
df.plot.pie()
plt.show()

You can add values counts to your Pie Chart

  • A simple chart of values above/below a threshold
  • .value_counts() Counts occurences of values in a Series (or DataFrame column)
  • A few arguments to .plot.pie()
    • colors=<list of colors>
    • labels=<list of labels>
    • title='<title>'
    • ylabel='<label>'
    • autopct='%1.1f%%' sets percentages on chart
(data['USA'] < 17.5).value_counts().plot.pie(colors=['r', 'g'], labels=['>=17.5', '<17.5'], title='CO2 per capita', autopct='%1.1f%%')
plt.show()

Scatter Plot

Assume we want to investigate if GDP per capita and CO2 per capita are correlated. Then a great way to get an idea about is by using a scatter plot.

Let’s try to do that. The data is available we just need to load it.

import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/co2_gdp_per_capita.csv', index_col=0)
data.plot.scatter(x='CO2 per capita', y='GDP per capita')
plt.show()

It seems there is some weak correlation – this can also be confirmed by calculating the correlation with data.corr() showing a 0.633178 correlation.

Step 4: Data Presentation

Data Presentation is about making data easy to digest.

Let’s try to make an example.

Assume we want to give a picture of how US CO2 per capita is compared to the rest of the world.

Preparation

  • Let’s take 2017 (as more recent data is incomplete)
  • What is the mean, max, and min CO2 per capital in the world
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0)
year = 2017
print(data.loc[year]['USA'])

This gives 14.8.

How can we tell a story?

  • US is above the mean
  • US is not the max
  • It is above 75%
ax = data.loc[year].plot.hist(bins=15, facecolor='green')
ax.set_xlabel('CO2 per capita')
ax.set_ylabel('Number of countries')
ax.annotate("USA", xy=(15, 5), xytext=(15, 30), 
            arrowprops=dict(arrowstyle="->",
                            connectionstyle="arc3"))

This is one way to tell a story.

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).