How to Web Scrape Specific Elements in Details

What will we cover?

Web scarping is a highly sought skill today – the reason is that many companies want to monitor competitors pages and scrape specific data. This is no one solution that can solve that task, and it need special code to specific requirements. Also, pages change all the time, hence, they need someone to adjust the scraping when pages change.

But How do you do it? How do you target web scraping of specific elements. Here you will learn how easy it is – and this can be the start of you earning money as a side hustle.

Step 1: What will you scrape?

In this tutorial we scrape google search page. Actually, google search provides a lot of valuable information for free.

If you search Copenhagen Weather you will get something similar to.

Let’s say you want to scrape the location, time, information (Mostly sunny), and temperature.

How would you do that?

Step 2: Use Request to get Webpage

The first we need to do, is to get the content of the google search.

For this you can use the library requests. It is not a standard lib (meaning you need to install it).

It can be installed in a terminal by the following command.

pip install requests

Then the following code will get the content of the webpage (see description below code).

import requests
# Google: 'what is my user agent' and paste into here
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}

def weather_info(city):
    city = city.replace(" ", "+")
    res = requests.get(
        f'https://www.google.com/search?q={city}&hl=en',
        headers=headers)
 weather_info("Copenhagen Weather")

First a note on the header.

When you make a request you need it to look like a browser, otherwise many webpages will not respond.

This will require you to insert a header. You can get a header by searching what is my user agent.

Given the header you can make a google search, which is structured by making requests call as to the following URI.

https://www.google.com/search?q=copenhagen+weather&hl=en

This can be done by a formatted string.

f'https://www.google.com/search?q={city}&hl=en'

If you would investigate the result in res, you would realize it contains a lot of data as well as the content in HTML.

This is not very convenient to use. We need some way to extract the data we want easy. This is where we need a library to do the hard work.

Step 3: Identify and Extract elements with BeautifulSoup

A webpage consists of a lot of HTML codes with some tags. It will get clear in a moment.

Let’s first install a library called BeautifulSoup.

pip install beautifulsoup4

This will help you extract elements easy.

First, let’s look at the code.

from bs4 import BeautifulSoup
import requests
# Google: 'what is my user agent' and paste into here
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.3 Safari/605.1.15'}

def weather_info(city):
    city = city.replace(" ", "+")
    res = requests.get(
        f'https://www.google.com/search?q={city}&hl=en',
        headers=headers)
    soup = BeautifulSoup(res.text, 'html.parser')
    # To find these - use Developer view and check Elements
    location = soup.select('#wob_loc')[0].getText().strip()
    time = soup.select('#wob_dts')[0].getText().strip()
    info = soup.select('#wob_dc')[0].getText().strip()
    weather = soup.select('#wob_tm')[0].getText().strip()
    print(location)
    print(time)
    print(info)
    print(weather+"°C")

weather_info("Copenhagen Weather")

What happens is, we input the res.text into a BeautifulSoup and then we simple select elements. A sample output could look similar to this.

Copenhagen
Sunday 10.00
Mostly sunny
22°C

That is perfect. We have successfully extracted the data we wanted.

Bonus: You can change the City to something different in the weather_info(…) call.

But no so fast, you might think. How did we get the elements.

Let’s explore this one as an example.

location = soup.select('#wob_loc')[0].getText().strip()

All the magic lies in the #wob_loc, so how did I find it?

I used my browser in developer mode (Here Chrome: Option + Command + J on Mac and Control+Shift+J on Windows).

Then choose the selection tool and click on the element you want.

You see it shows you #wob_loc. (and some more) in the white box above.

This can be done similarly for all elements.

That is basically it.

Batch Process Face Detection in 3 Steps with OpenCV

What will you learn?

You want to extract or identify faces on a bunch of images, but how do you do that without becoming a Machine Learning expert?

Here you will learn how to do it without any Machine Learning skills.

Many Machine Learning things are done so often you can just use pre-built Machine Learning models. Here you will learn the task of finding faces and extract locations of them.

Step 1: Pre-built OpenCV models to detect faces

When you think of detecting faces on images, you might get scared. I’ve been there, but there is nothing to be scared of, because some awesome people already did all the hard work for you.

They built a model, which can detect faces on images.

All you need to do, is, to feed it with images and let it do all the work.

This boils down to the following.

  1. We need to know what model to use.
  2. How to feed it with images.
  3. How to use the results it brings and convert it to something useful.

This is what the rest of this tutorial will teach you.

We will use OpenCV and their pre-built detection model haarcascade.

First you should download and install the requirements.

This can be done either by cloning this repository.

Or download the files as a zip-file and unpack them.

You should install opencv-python library. This can be done as follows.

pip install opencv-python

You can also use the requirements.txt file to install it.

pip install -r requirements.txt

Step 2: Detect a face

We will use this image to start with.

The picture is part of the repository from step 1.

Now let’s explore the code in face_detection.py.

# importing opencv
import cv2
# using cv2.CascadeClassifier
# See https://docs.opencv.org/3.4/db/d28/tutorial_cascade_classifier.html
# See more Cascade Classifiers https://github.com/opencv/opencv/tree/4.x/data/haarcascades
face_cascade = cv2.CascadeClassifier("haarcascade_frontalface_default.xml")
img = cv2.imread("sample_images/sample-00.jpg")
# changing the image to gray scale for better face detection
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
faces = face_cascade.detectMultiScale(
    gray,
    scaleFactor=2,  # Big reduction
    minNeighbors=5  # 4-6 range
)
# drawing a rectangle to the image.
# for loop is used to access all the coordinates of the rectangle.
for x, y, w, h in faces:
    cv2.rectangle(img, (x, y), (x+w, y+h), (0, 255, 0), 5)
# showing the detected face followed by the waitKey method.
cv2.imshow("image", img)
cv2.waitKey(0)
cv2.destroyAllWindows()

First notice, that the opencv-python package is imported by import cv2.

Then also, notice we need to run this code in from where the file haarcascade_frontalface_default.xml is located.

After that you will read the image into the variable img. Notice, that this assumes you run the file like they are structure in the GitHub (downloaded in step 1).

When you work with images, you often do not need the level of details given in it. Therefore, the first thing we doit to gray scale the image.

After we have gray scaled the image we use the face detection model (face_cascade.detectMultiScale).

This will give the result faces, which is an iterable.

We want to insert rectangles of the images in the original image (not the gray scaled).

Finally, we show the image and wait until someone hist a key.

Step 3: Batch process face detection

To batch process face detection, a great idea is to build a class to do the face detections. It could be designed in many ways. But the idea is to decouple the filename processing from the actual face detection.

One way to do it could be as follows.

import os
import cv2

class FaceDetector:
    def __init__(self, scale_factor=2, min_neighbors=5):
        self.face_cascade = cv2.CascadeClassifier("haarcascade_frontalface_default.xml")
        self.scale_factor = scale_factor
        self.min_neighbors = min_neighbors
        self.img = None
    def read_image(self, filename):
        self.img = cv2.imread(filename)
    def detect_faces(self):
        gray = cv2.cvtColor(self.img, cv2.COLOR_BGR2GRAY)
        faces = self.face_cascade.detectMultiScale(
            gray,
            scaleFactor=self.scale_factor,
            minNeighbors=self.min_neighbors
        )
        # drawing a rectangle to the image.
        # for loop is used to access all the coordinates of the rectangle.
        for x, y, w, h in faces:
            cv2.rectangle(self.img, (x, y), (x + w, y + h), (0, 255, 0), 5)
        return self.img

face_detector = FaceDetector()
for filename in os.listdir('sample_images/'):
    print(filename)
    face_detector.read_image(f'sample_images/{filename}')
    img = face_detector.detect_faces()
    cv2.imshow("image", img)
    cv2.waitKey(0)
    cv2.destroyAllWindows()

If you want to write the files to storage with face detections, you should exchange the the line cv2.imshow with the following.

    cv2.imwrite(filename, img)

Want to learn more Machine Learning?

You will surprised how easy Machine Learning has become. There are many great and easy to use libraries. All you need to learn is how to train them and use them to predict.

If you want to learn more?

Then I created this 10 hours free Machine Learning course, which will cover all you need.

  • 15 video lessons – which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).

The Ultimate Pic Chart Guide for Matplotlib

What will you learn?

Pie charts are one of the most powerful visualizations when presenting them. With a few tricks you can make them look professional with a free tool like Matplotlib.

In the end of this tutorial you will know how to make pie charts and customize it even further.

Basic Pie Chart

First you need to make a basic Pie chart with matplotlib.

import matplotlib.pyplot as plt
v = [2, 5, 3, 1, 4]
labels = ["A", "B", "C", "D", "E"]
plt.pie(v, labels=labels)
plt.show()

This will create a chart based on the values in v with the labels in labels.

Based on the above Pie Chart we can continue to build further understanding of how to create more advanced charts.

Exploding Segment in Pie Charts

An exploding segment in a pie chart is simply moving segments of the pie chart out.

The following example will demonstrate it.

import matplotlib.pyplot as plt
v = [2, 5, 3, 1, 4]
labels = ["A", "B", "C", "D", "E"]
explode = [0, 0.1, 0, 0.2, 0]
plt.pie(v, labels=labels, explode=explode)
plt.show()

Though not very pretty, it shows you how to control each segment.

Now let’s learn a bit more about how to style it.

Styling Pie Charts

The following list sets the most used parameters for the pie chart.

  • labels The labels.
  • colors The colors.
  • explode Indicates offset of each segment.
  • startangle Angle to start from.
  • counterclock Default True and sets direction.
  • shadow Enables shadow effect.
  • wedgeprops Example {"edgecolor":"k",'linewidth': 1}.
  • autopct Format indicating percentage labels "%1.1f%%".
  • pctdistance Controls the position of percentage labels.

We already know the labels from above. But let’s add some more to see the effect.

import matplotlib.pyplot as plt
v = [2, 5, 3, 1, 4]
labels = ["A", "B", "C", "D", "E"]
colors = ["blue", "red", "orange", "purple", "brown"]
explode = [0, 0, 0.1, 0, 0]
wedge_properties = {"edgecolor":"k",'linewidth': 1}
plt.pie(v, labels=labels, explode=explode, colors=colors, startangle=30,
           counterclock=False, shadow=True, wedgeprops=wedge_properties,
           autopct="%1.1f%%", pctdistance=0.7)
plt.title("Color pie chart")
plt.show()

This does a decent job.

Donut Chart

A great chart to play with is the Donut chart.

Actually, pretty simple by setting wedgeprops as this example shows.

import matplotlib.pyplot as plt
v1 = [2, 5, 3, 1, 4]
labels1 = ["A", "B", "C", "D", "E"]
width = 0.3
wedge_properties = {"width":width}
plt.pie(v1, labels=labels1, wedgeprops=wedge_properties)
plt.show()

The width is taken from outside and in.

Legends on Pie Chart

You can add a legend, which uses the labels. Also, notice that you can set the placement (loc) of the legend.

import matplotlib.pyplot as plt
labels = 'Dalmatians', 'Beagles', 'Labradors', 'German Shepherds'
sizes = [6, 5, 20, 9]
fig, ax = plt.subplots()
ax.pie(sizes, labels=labels, autopct='%.1f%%')
ax.legend(labels, loc='lower left')
plt.show()

Nested Donut Pie Chart

This one is needed in any situation to show a bit off.

import matplotlib.pyplot as plt
v1 = [2, 5, 3, 1, 4]
labels1 = ["A", "B", "C", "D", "E"]
v2 = [4, 1, 3, 4, 1]
labels2 = ["V", "W", "X", "Y", "Z"]
width = 0.3
wedge_properties = {"width":width, "edgecolor":"w",'linewidth': 2}
plt.pie(v1, labels=labels1, labeldistance=0.85,
        wedgeprops=wedge_properties)
plt.pie(v2, labels=labels2, labeldistance=0.75,
        radius=1-width, wedgeprops=wedge_properties)
plt.show()

Want to learn more?

Actually Data Visualization is an important skill to understand and present data.

This is a key skill in Data Science. If you like to learn more then check my free Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).

11 Useful pandas Charts with One Line of Code

What will you learn?

When trying to understand data, visualization is the key to fast understand it!

Data visualization has 3 purposes.

  1. Data Quality: Finding outliers and missing data.
  2. Data Exploration: Understand the data.
  3. Data Presentation: Present the result.

Here you will learn 11 useful charts to understand your data and they are done in one line of code.

The data we will work with

We need some data to work with.

You can either download the Notebook and csv-file (GitHub repo) or read it directly from repository as follows.

import pandas as pd
import matplotlib.pyplot as plt
file_url = 'https://raw.githubusercontent.com/LearnPythonWithRune/pandas_charts/main/air_quality.csv'
data = pd.read_csv(file_url, index_col=0, parse_dates=True)
print(data)

This will output the first 5 lines of the data.

Now let’s use the data we have in the DataFrame data.

If you are new to pandas, I suggest you get an understanding of them from this guide.

#1 Simple plot

A simple plot is the default to use unless you know what you want. It will demonstrate the nature of the data.

Let’s try to do it here.

data.plot()

As you notice, there are three columns of data for the 3 stations: Antwerp, Paris, and London.

The data is a datetime series, meaning, that each data point is part of a time series (the x-axis).

It is a bit difficult to see if station Antwerp has a full dataset.

Let’s try to figure that out.

#2 Isolated plot

This leads us to making an isolated plot of only one column. This is handy to understand each individual column of data better.

Here we were a bit curious about if the data of station Antwerp was given for all dates.

data['station_antwerp'].plot()

This shows that our suspicion was correct. The time series is not covering the full range for station Antwerp.

This tells us about the data quality, which might be crucial for further analysis.

You can do the same for the other two columns.

#3 Scatter Plot

A great way to see if there is a correlation of data, is to make a scatter plot.

Let’s demonstrate how that looks like.

data.plot.scatter(x='station_london', y='station_paris', alpha=.25)

You see that data is not totally scattered all over, but is not fully correlated either. This means, that there is come weak correlation of the data and it is not fully independent of each other.

#4 Box Plot

One way to understand data better is by a box plot. It might need a bit of understanding of simple statistics.

Let’s first take a look at it.

data.plot.box()

The box plot shows the following.

To understand what outliers, min, median, max, and so forth means, I would suggest you read this simple statistic guide.

#5 Area Plot

An area plot can show you the data in a great way to see how the values follow each other in a visual easy way to get an understanding of values, correlation, and missing data.

data.plot.area(figsize=(12,4), subplots=True)

#6 Bar plots

Bar plots can be useful, but often when the data is more limited.

Here you see a bar plot of the first 15 rows of data.

data.iloc[:15].plot.bar()

#7 Histograms for single column

Histograms will show you what data is most common. It shows the frequencies of data divided into bins. By default there are 10 bins of data.

It is an amazing tool to get a fast view of the number of occurrences of each data range.

Here first for an isolated station.

data['station_paris'].plot.hist()

#8 Histograms for multiple columns

Then for all three stations, where you see it with transparency (alpha).

data.plot.hist(alpha=.5)

#9 Pie

Pie charts are very powerful, when you want to show a division of data.

How many percentage belong to each category.

Here you see the mean value of each station.

data.mean().plot.pie()

#10 Scatter Matrix Plot

This is a great tool for showing data for combined in all possible ways. This will show you correlations and how data is distributed.

You need to import an additional library, but it gives you fast understanding of data.

from pandas.plotting import scatter_matrix
scatter_matrix(data, alpha=0.2, figsize=(6, 6))

#11 Secondary y-axis

Finally, sometimes you want two plots on the same chart. The problem can be, that the two plots have very different ranges. hence, you would like to have two different y-axes, with different ranges.

This will enable you to have plots on the same chart with different ranges.

data['station_london'].plot()
data['station_paris'].cumsum().plot(secondary_y=True)

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

Then check my free Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).