How to Web Scrape Specific Elements in Details

What will we cover?

Web scarping is a highly sought skill today – the reason is that many companies want to monitor competitors pages and scrape specific data. This is no one solution that can solve that task, and it need special code to specific requirements. Also, pages change all the time, hence, they need someone to adjust the scraping when pages change.

But How do you do it? How do you target web scraping of specific elements. Here you will learn how easy it is – and this can be the start of you earning money as a side hustle.

Step 1: What will you scrape?

In this tutorial we scrape google search page. Actually, google search provides a lot of valuable information for free.

If you search Copenhagen Weather you will get something similar to.

Let’s say you want to scrape the location, time, information (Mostly sunny), and temperature.

How would you do that?

Step 2: Use Request to get Webpage

The first we need to do, is to get the content of the google search.

For this you can use the library requests. It is not a standard lib (meaning you need to install it).

It can be installed in a terminal by the following command.

pip install requests

Then the following code will get the content of the webpage (see description below code).

import requests

# Google: 'what is my user agent' and paste into here
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}


def weather_info(city):
    city = city.replace(" ", "+")
    res = requests.get(
        f'https://www.google.com/search?q={city}&hl=en',
        headers=headers)

 weather_info("Copenhagen Weather")

First a note on the header.

When you make a request you need it to look like a browser, otherwise many webpages will not respond.

This will require you to insert a header. You can get a header by searching what is my user agent.

Given the header you can make a google search, which is structured by making requests call as to the following URI.

https://www.google.com/search?q=copenhagen+weather&hl=en

This can be done by a formatted string.

f'https://www.google.com/search?q={city}&hl=en'

If you would investigate the result in res, you would realize it contains a lot of data as well as the content in HTML.

This is not very convenient to use. We need some way to extract the data we want easy. This is where we need a library to do the hard work.

Step 3: Identify and Extract elements with BeautifulSoup

A webpage consists of a lot of HTML codes with some tags. It will get clear in a moment.

Let’s first install a library called BeautifulSoup.

pip install beautifulsoup4

This will help you extract elements easy.

First, let’s look at the code.

from bs4 import BeautifulSoup
import requests

# Google: 'what is my user agent' and paste into here
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.3 Safari/605.1.15'}


def weather_info(city):
    city = city.replace(" ", "+")
    res = requests.get(
        f'https://www.google.com/search?q={city}&hl=en',
        headers=headers)

    soup = BeautifulSoup(res.text, 'html.parser')

    # To find these - use Developer view and check Elements
    location = soup.select('#wob_loc')[0].getText().strip()
    time = soup.select('#wob_dts')[0].getText().strip()
    info = soup.select('#wob_dc')[0].getText().strip()
    weather = soup.select('#wob_tm')[0].getText().strip()

    print(location)
    print(time)
    print(info)
    print(weather+"°C")


weather_info("Copenhagen Weather")

What happens is, we input the res.text into a BeautifulSoup and then we simple select elements. A sample output could look similar to this.

Copenhagen
Sunday 10.00
Mostly sunny
22°C

That is perfect. We have successfully extracted the data we wanted.

Bonus: You can change the City to something different in the weather_info(…) call.

But no so fast, you might think. How did we get the elements.

Let’s explore this one as an example.

location = soup.select('#wob_loc')[0].getText().strip()

All the magic lies in the #wob_loc, so how did I find it?

I used my browser in developer mode (Here Chrome: Option + Command + J on Mac and Control+Shift+J on Windows).

Then choose the selection tool and click on the element you want.

You see it shows you #wob_loc. (and some more) in the white box above.

This can be done similarly for all elements.

That is basically it.

Batch Process Face Detection in 3 Steps with OpenCV

What will you learn?

You want to extract or identify faces on a bunch of images, but how do you do that without becoming a Machine Learning expert?

Here you will learn how to do it without any Machine Learning skills.

Many Machine Learning things are done so often you can just use pre-built Machine Learning models. Here you will learn the task of finding faces and extract locations of them.

Step 1: Pre-built OpenCV models to detect faces

When you think of detecting faces on images, you might get scared. I’ve been there, but there is nothing to be scared of, because some awesome people already did all the hard work for you.

They built a model, which can detect faces on images.

All you need to do, is, to feed it with images and let it do all the work.

This boils down to the following.

  1. We need to know what model to use.
  2. How to feed it with images.
  3. How to use the results it brings and convert it to something useful.

This is what the rest of this tutorial will teach you.

We will use OpenCV and their pre-built detection model haarcascade.

First you should download and install the requirements.

This can be done either by cloning this repository.

Or download the files as a zip-file and unpack them.

You should install opencv-python library. This can be done as follows.

pip install opencv-python

You can also use the requirements.txt file to install it.

pip install -r requirements.txt

Step 2: Detect a face

We will use this image to start with.

The picture is part of the repository from step 1.

Now let’s explore the code in face_detection.py.

# importing opencv
import cv2

# using cv2.CascadeClassifier
# See https://docs.opencv.org/3.4/db/d28/tutorial_cascade_classifier.html
# See more Cascade Classifiers https://github.com/opencv/opencv/tree/4.x/data/haarcascades
face_cascade = cv2.CascadeClassifier("haarcascade_frontalface_default.xml")

img = cv2.imread("sample_images/sample-00.jpg")

# changing the image to gray scale for better face detection
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

faces = face_cascade.detectMultiScale(
    gray,
    scaleFactor=2,  # Big reduction
    minNeighbors=5  # 4-6 range
)

# drawing a rectangle to the image.
# for loop is used to access all the coordinates of the rectangle.
for x, y, w, h in faces:
    cv2.rectangle(img, (x, y), (x+w, y+h), (0, 255, 0), 5)

# showing the detected face followed by the waitKey method.
cv2.imshow("image", img)
cv2.waitKey(0)
cv2.destroyAllWindows()

First notice, that the opencv-python package is imported by import cv2.

Then also, notice we need to run this code in from where the file haarcascade_frontalface_default.xml is located.

After that you will read the image into the variable img. Notice, that this assumes you run the file like they are structure in the GitHub (downloaded in step 1).

When you work with images, you often do not need the level of details given in it. Therefore, the first thing we doit to gray scale the image.

After we have gray scaled the image we use the face detection model (face_cascade.detectMultiScale).

This will give the result faces, which is an iterable.

We want to insert rectangles of the images in the original image (not the gray scaled).

Finally, we show the image and wait until someone hist a key.

Step 3: Batch process face detection

To batch process face detection, a great idea is to build a class to do the face detections. It could be designed in many ways. But the idea is to decouple the filename processing from the actual face detection.

One way to do it could be as follows.

import os
import cv2


class FaceDetector:
    def __init__(self, scale_factor=2, min_neighbors=5):
        self.face_cascade = cv2.CascadeClassifier("haarcascade_frontalface_default.xml")
        self.scale_factor = scale_factor
        self.min_neighbors = min_neighbors
        self.img = None

    def read_image(self, filename):
        self.img = cv2.imread(filename)

    def detect_faces(self):
        gray = cv2.cvtColor(self.img, cv2.COLOR_BGR2GRAY)

        faces = self.face_cascade.detectMultiScale(
            gray,
            scaleFactor=self.scale_factor,
            minNeighbors=self.min_neighbors
        )

        # drawing a rectangle to the image.
        # for loop is used to access all the coordinates of the rectangle.
        for x, y, w, h in faces:
            cv2.rectangle(self.img, (x, y), (x + w, y + h), (0, 255, 0), 5)

        return self.img


face_detector = FaceDetector()

for filename in os.listdir('sample_images/'):
    print(filename)
    face_detector.read_image(f'sample_images/{filename}')
    img = face_detector.detect_faces()

    cv2.imshow("image", img)
    cv2.waitKey(0)
    cv2.destroyAllWindows()

If you want to write the files to storage with face detections, you should exchange the the line cv2.imshow with the following.

    cv2.imwrite(filename, img)

Want to learn more Machine Learning?

You will surprised how easy Machine Learning has become. There are many great and easy to use libraries. All you need to learn is how to train them and use them to predict.

If you want to learn more?

Then I created this 10 hours free Machine Learning course, which will cover all you need.

  • 15 video lessons – which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).

The Ultimate Pic Chart Guide for Matplotlib

What will you learn?

Pie charts are one of the most powerful visualizations when presenting them. With a few tricks you can make them look professional with a free tool like Matplotlib.

In the end of this tutorial you will know how to make pie charts and customize it even further.

Basic Pie Chart

First you need to make a basic Pie chart with matplotlib.

import matplotlib.pyplot as plt

v = [2, 5, 3, 1, 4]
labels = ["A", "B", "C", "D", "E"]

plt.pie(v, labels=labels)
plt.show()

This will create a chart based on the values in v with the labels in labels.

Based on the above Pie Chart we can continue to build further understanding of how to create more advanced charts.

Exploding Segment in Pie Charts

An exploding segment in a pie chart is simply moving segments of the pie chart out.

The following example will demonstrate it.

import matplotlib.pyplot as plt

v = [2, 5, 3, 1, 4]
labels = ["A", "B", "C", "D", "E"]
explode = [0, 0.1, 0, 0.2, 0]

plt.pie(v, labels=labels, explode=explode)
plt.show()

Though not very pretty, it shows you how to control each segment.

Now let’s learn a bit more about how to style it.

Styling Pie Charts

The following list sets the most used parameters for the pie chart.

  • labels The labels.
  • colors The colors.
  • explode Indicates offset of each segment.
  • startangle Angle to start from.
  • counterclock Default True and sets direction.
  • shadow Enables shadow effect.
  • wedgeprops Example {"edgecolor":"k",'linewidth': 1}.
  • autopct Format indicating percentage labels "%1.1f%%".
  • pctdistance Controls the position of percentage labels.

We already know the labels from above. But let’s add some more to see the effect.

import matplotlib.pyplot as plt

v = [2, 5, 3, 1, 4]
labels = ["A", "B", "C", "D", "E"]
colors = ["blue", "red", "orange", "purple", "brown"]
explode = [0, 0, 0.1, 0, 0]
wedge_properties = {"edgecolor":"k",'linewidth': 1}

plt.pie(v, labels=labels, explode=explode, colors=colors, startangle=30,
           counterclock=False, shadow=True, wedgeprops=wedge_properties,
           autopct="%1.1f%%", pctdistance=0.7)
plt.title("Color pie chart")
plt.show()

This does a decent job.

Donut Chart

A great chart to play with is the Donut chart.

Actually, pretty simple by setting wedgeprops as this example shows.

import matplotlib.pyplot as plt

v1 = [2, 5, 3, 1, 4]
labels1 = ["A", "B", "C", "D", "E"]
width = 0.3
wedge_properties = {"width":width}

plt.pie(v1, labels=labels1, wedgeprops=wedge_properties)
plt.show()

The width is taken from outside and in.

Legends on Pie Chart

You can add a legend, which uses the labels. Also, notice that you can set the placement (loc) of the legend.

import matplotlib.pyplot as plt

labels = 'Dalmatians', 'Beagles', 'Labradors', 'German Shepherds'
sizes = [6, 5, 20, 9]
fig, ax = plt.subplots()
ax.pie(sizes, labels=labels, autopct='%.1f%%')
ax.legend(labels, loc='lower left')
plt.show()

Nested Donut Pie Chart

This one is needed in any situation to show a bit off.

import matplotlib.pyplot as plt

v1 = [2, 5, 3, 1, 4]
labels1 = ["A", "B", "C", "D", "E"]
v2 = [4, 1, 3, 4, 1]
labels2 = ["V", "W", "X", "Y", "Z"]
width = 0.3
wedge_properties = {"width":width, "edgecolor":"w",'linewidth': 2}

plt.pie(v1, labels=labels1, labeldistance=0.85,
        wedgeprops=wedge_properties)
plt.pie(v2, labels=labels2, labeldistance=0.75,
        radius=1-width, wedgeprops=wedge_properties)
plt.show()

Want to learn more?

Actually Data Visualization is an important skill to understand and present data.

This is a key skill in Data Science. If you like to learn more then check my free Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
Data Science
Exit mobile version