How to Web Scrape Specific Elements in Details

What will we cover?

Web scarping is a highly sought skill today – the reason is that many companies want to monitor competitors pages and scrape specific data. This is no one solution that can solve that task, and it need special code to specific requirements. Also, pages change all the time, hence, they need someone to adjust the scraping when pages change.

But How do you do it? How do you target web scraping of specific elements. Here you will learn how easy it is – and this can be the start of you earning money as a side hustle.

Step 1: What will you scrape?

In this tutorial we scrape google search page. Actually, google search provides a lot of valuable information for free.

If you search Copenhagen Weather you will get something similar to.

Let’s say you want to scrape the location, time, information (Mostly sunny), and temperature.

How would you do that?

Step 2: Use Request to get Webpage

The first we need to do, is to get the content of the google search.

For this you can use the library requests. It is not a standard lib (meaning you need to install it).

It can be installed in a terminal by the following command.

pip install requests

Then the following code will get the content of the webpage (see description below code).

import requests
# Google: 'what is my user agent' and paste into here
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}

def weather_info(city):
    city = city.replace(" ", "+")
    res = requests.get(
        f'https://www.google.com/search?q={city}&hl=en',
        headers=headers)
 weather_info("Copenhagen Weather")

First a note on the header.

When you make a request you need it to look like a browser, otherwise many webpages will not respond.

This will require you to insert a header. You can get a header by searching what is my user agent.

Given the header you can make a google search, which is structured by making requests call as to the following URI.

https://www.google.com/search?q=copenhagen+weather&hl=en

This can be done by a formatted string.

f'https://www.google.com/search?q={city}&hl=en'

If you would investigate the result in res, you would realize it contains a lot of data as well as the content in HTML.

This is not very convenient to use. We need some way to extract the data we want easy. This is where we need a library to do the hard work.

Step 3: Identify and Extract elements with BeautifulSoup

A webpage consists of a lot of HTML codes with some tags. It will get clear in a moment.

Let’s first install a library called BeautifulSoup.

pip install beautifulsoup4

This will help you extract elements easy.

First, let’s look at the code.

from bs4 import BeautifulSoup
import requests
# Google: 'what is my user agent' and paste into here
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.3 Safari/605.1.15'}

def weather_info(city):
    city = city.replace(" ", "+")
    res = requests.get(
        f'https://www.google.com/search?q={city}&hl=en',
        headers=headers)
    soup = BeautifulSoup(res.text, 'html.parser')
    # To find these - use Developer view and check Elements
    location = soup.select('#wob_loc')[0].getText().strip()
    time = soup.select('#wob_dts')[0].getText().strip()
    info = soup.select('#wob_dc')[0].getText().strip()
    weather = soup.select('#wob_tm')[0].getText().strip()
    print(location)
    print(time)
    print(info)
    print(weather+"°C")

weather_info("Copenhagen Weather")

What happens is, we input the res.text into a BeautifulSoup and then we simple select elements. A sample output could look similar to this.

Copenhagen
Sunday 10.00
Mostly sunny
22°C

That is perfect. We have successfully extracted the data we wanted.

Bonus: You can change the City to something different in the weather_info(…) call.

But no so fast, you might think. How did we get the elements.

Let’s explore this one as an example.

location = soup.select('#wob_loc')[0].getText().strip()

All the magic lies in the #wob_loc, so how did I find it?

I used my browser in developer mode (Here Chrome: Option + Command + J on Mac and Control+Shift+J on Windows).

Then choose the selection tool and click on the element you want.

You see it shows you #wob_loc. (and some more) in the white box above.

This can be done similarly for all elements.

That is basically it.

The Ultimate Pic Chart Guide for Matplotlib

What will you learn?

Pie charts are one of the most powerful visualizations when presenting them. With a few tricks you can make them look professional with a free tool like Matplotlib.

In the end of this tutorial you will know how to make pie charts and customize it even further.

Basic Pie Chart

First you need to make a basic Pie chart with matplotlib.

import matplotlib.pyplot as plt
v = [2, 5, 3, 1, 4]
labels = ["A", "B", "C", "D", "E"]
plt.pie(v, labels=labels)
plt.show()

This will create a chart based on the values in v with the labels in labels.

Based on the above Pie Chart we can continue to build further understanding of how to create more advanced charts.

Exploding Segment in Pie Charts

An exploding segment in a pie chart is simply moving segments of the pie chart out.

The following example will demonstrate it.

import matplotlib.pyplot as plt
v = [2, 5, 3, 1, 4]
labels = ["A", "B", "C", "D", "E"]
explode = [0, 0.1, 0, 0.2, 0]
plt.pie(v, labels=labels, explode=explode)
plt.show()

Though not very pretty, it shows you how to control each segment.

Now let’s learn a bit more about how to style it.

Styling Pie Charts

The following list sets the most used parameters for the pie chart.

  • labels The labels.
  • colors The colors.
  • explode Indicates offset of each segment.
  • startangle Angle to start from.
  • counterclock Default True and sets direction.
  • shadow Enables shadow effect.
  • wedgeprops Example {"edgecolor":"k",'linewidth': 1}.
  • autopct Format indicating percentage labels "%1.1f%%".
  • pctdistance Controls the position of percentage labels.

We already know the labels from above. But let’s add some more to see the effect.

import matplotlib.pyplot as plt
v = [2, 5, 3, 1, 4]
labels = ["A", "B", "C", "D", "E"]
colors = ["blue", "red", "orange", "purple", "brown"]
explode = [0, 0, 0.1, 0, 0]
wedge_properties = {"edgecolor":"k",'linewidth': 1}
plt.pie(v, labels=labels, explode=explode, colors=colors, startangle=30,
           counterclock=False, shadow=True, wedgeprops=wedge_properties,
           autopct="%1.1f%%", pctdistance=0.7)
plt.title("Color pie chart")
plt.show()

This does a decent job.

Donut Chart

A great chart to play with is the Donut chart.

Actually, pretty simple by setting wedgeprops as this example shows.

import matplotlib.pyplot as plt
v1 = [2, 5, 3, 1, 4]
labels1 = ["A", "B", "C", "D", "E"]
width = 0.3
wedge_properties = {"width":width}
plt.pie(v1, labels=labels1, wedgeprops=wedge_properties)
plt.show()

The width is taken from outside and in.

Legends on Pie Chart

You can add a legend, which uses the labels. Also, notice that you can set the placement (loc) of the legend.

import matplotlib.pyplot as plt
labels = 'Dalmatians', 'Beagles', 'Labradors', 'German Shepherds'
sizes = [6, 5, 20, 9]
fig, ax = plt.subplots()
ax.pie(sizes, labels=labels, autopct='%.1f%%')
ax.legend(labels, loc='lower left')
plt.show()

Nested Donut Pie Chart

This one is needed in any situation to show a bit off.

import matplotlib.pyplot as plt
v1 = [2, 5, 3, 1, 4]
labels1 = ["A", "B", "C", "D", "E"]
v2 = [4, 1, 3, 4, 1]
labels2 = ["V", "W", "X", "Y", "Z"]
width = 0.3
wedge_properties = {"width":width, "edgecolor":"w",'linewidth': 2}
plt.pie(v1, labels=labels1, labeldistance=0.85,
        wedgeprops=wedge_properties)
plt.pie(v2, labels=labels2, labeldistance=0.75,
        radius=1-width, wedgeprops=wedge_properties)
plt.show()

Want to learn more?

Actually Data Visualization is an important skill to understand and present data.

This is a key skill in Data Science. If you like to learn more then check my free Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).

11 Useful pandas Charts with One Line of Code

What will you learn?

When trying to understand data, visualization is the key to fast understand it!

Data visualization has 3 purposes.

  1. Data Quality: Finding outliers and missing data.
  2. Data Exploration: Understand the data.
  3. Data Presentation: Present the result.

Here you will learn 11 useful charts to understand your data and they are done in one line of code.

The data we will work with

We need some data to work with.

You can either download the Notebook and csv-file (GitHub repo) or read it directly from repository as follows.

import pandas as pd
import matplotlib.pyplot as plt
file_url = 'https://raw.githubusercontent.com/LearnPythonWithRune/pandas_charts/main/air_quality.csv'
data = pd.read_csv(file_url, index_col=0, parse_dates=True)
print(data)

This will output the first 5 lines of the data.

Now let’s use the data we have in the DataFrame data.

If you are new to pandas, I suggest you get an understanding of them from this guide.

#1 Simple plot

A simple plot is the default to use unless you know what you want. It will demonstrate the nature of the data.

Let’s try to do it here.

data.plot()

As you notice, there are three columns of data for the 3 stations: Antwerp, Paris, and London.

The data is a datetime series, meaning, that each data point is part of a time series (the x-axis).

It is a bit difficult to see if station Antwerp has a full dataset.

Let’s try to figure that out.

#2 Isolated plot

This leads us to making an isolated plot of only one column. This is handy to understand each individual column of data better.

Here we were a bit curious about if the data of station Antwerp was given for all dates.

data['station_antwerp'].plot()

This shows that our suspicion was correct. The time series is not covering the full range for station Antwerp.

This tells us about the data quality, which might be crucial for further analysis.

You can do the same for the other two columns.

#3 Scatter Plot

A great way to see if there is a correlation of data, is to make a scatter plot.

Let’s demonstrate how that looks like.

data.plot.scatter(x='station_london', y='station_paris', alpha=.25)

You see that data is not totally scattered all over, but is not fully correlated either. This means, that there is come weak correlation of the data and it is not fully independent of each other.

#4 Box Plot

One way to understand data better is by a box plot. It might need a bit of understanding of simple statistics.

Let’s first take a look at it.

data.plot.box()

The box plot shows the following.

To understand what outliers, min, median, max, and so forth means, I would suggest you read this simple statistic guide.

#5 Area Plot

An area plot can show you the data in a great way to see how the values follow each other in a visual easy way to get an understanding of values, correlation, and missing data.

data.plot.area(figsize=(12,4), subplots=True)

#6 Bar plots

Bar plots can be useful, but often when the data is more limited.

Here you see a bar plot of the first 15 rows of data.

data.iloc[:15].plot.bar()

#7 Histograms for single column

Histograms will show you what data is most common. It shows the frequencies of data divided into bins. By default there are 10 bins of data.

It is an amazing tool to get a fast view of the number of occurrences of each data range.

Here first for an isolated station.

data['station_paris'].plot.hist()

#8 Histograms for multiple columns

Then for all three stations, where you see it with transparency (alpha).

data.plot.hist(alpha=.5)

#9 Pie

Pie charts are very powerful, when you want to show a division of data.

How many percentage belong to each category.

Here you see the mean value of each station.

data.mean().plot.pie()

#10 Scatter Matrix Plot

This is a great tool for showing data for combined in all possible ways. This will show you correlations and how data is distributed.

You need to import an additional library, but it gives you fast understanding of data.

from pandas.plotting import scatter_matrix
scatter_matrix(data, alpha=0.2, figsize=(6, 6))

#11 Secondary y-axis

Finally, sometimes you want two plots on the same chart. The problem can be, that the two plots have very different ranges. hence, you would like to have two different y-axes, with different ranges.

This will enable you to have plots on the same chart with different ranges.

data['station_london'].plot()
data['station_paris'].cumsum().plot(secondary_y=True)

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

Then check my free Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).