How to Web Scrape Specific Elements in Details

What will we cover?

Web scarping is a highly sought skill today – the reason is that many companies want to monitor competitors pages and scrape specific data. This is no one solution that can solve that task, and it need special code to specific requirements. Also, pages change all the time, hence, they need someone to adjust the scraping when pages change.

But How do you do it? How do you target web scraping of specific elements. Here you will learn how easy it is – and this can be the start of you earning money as a side hustle.

Step 1: What will you scrape?

In this tutorial we scrape google search page. Actually, google search provides a lot of valuable information for free.

If you search Copenhagen Weather you will get something similar to.

Let’s say you want to scrape the location, time, information (Mostly sunny), and temperature.

How would you do that?

Step 2: Use Request to get Webpage

The first we need to do, is to get the content of the google search.

For this you can use the library requests. It is not a standard lib (meaning you need to install it).

It can be installed in a terminal by the following command.

pip install requests

Then the following code will get the content of the webpage (see description below code).

import requests

# Google: 'what is my user agent' and paste into here
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}


def weather_info(city):
    city = city.replace(" ", "+")
    res = requests.get(
        f'https://www.google.com/search?q={city}&hl=en',
        headers=headers)

 weather_info("Copenhagen Weather")

First a note on the header.

When you make a request you need it to look like a browser, otherwise many webpages will not respond.

This will require you to insert a header. You can get a header by searching what is my user agent.

Given the header you can make a google search, which is structured by making requests call as to the following URI.

https://www.google.com/search?q=copenhagen+weather&hl=en

This can be done by a formatted string.

f'https://www.google.com/search?q={city}&hl=en'

If you would investigate the result in res, you would realize it contains a lot of data as well as the content in HTML.

This is not very convenient to use. We need some way to extract the data we want easy. This is where we need a library to do the hard work.

Step 3: Identify and Extract elements with BeautifulSoup

A webpage consists of a lot of HTML codes with some tags. It will get clear in a moment.

Let’s first install a library called BeautifulSoup.

pip install beautifulsoup4

This will help you extract elements easy.

First, let’s look at the code.

from bs4 import BeautifulSoup
import requests

# Google: 'what is my user agent' and paste into here
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.3 Safari/605.1.15'}


def weather_info(city):
    city = city.replace(" ", "+")
    res = requests.get(
        f'https://www.google.com/search?q={city}&hl=en',
        headers=headers)

    soup = BeautifulSoup(res.text, 'html.parser')

    # To find these - use Developer view and check Elements
    location = soup.select('#wob_loc')[0].getText().strip()
    time = soup.select('#wob_dts')[0].getText().strip()
    info = soup.select('#wob_dc')[0].getText().strip()
    weather = soup.select('#wob_tm')[0].getText().strip()

    print(location)
    print(time)
    print(info)
    print(weather+"°C")


weather_info("Copenhagen Weather")

What happens is, we input the res.text into a BeautifulSoup and then we simple select elements. A sample output could look similar to this.

Copenhagen
Sunday 10.00
Mostly sunny
22°C

That is perfect. We have successfully extracted the data we wanted.

Bonus: You can change the City to something different in the weather_info(…) call.

But no so fast, you might think. How did we get the elements.

Let’s explore this one as an example.

location = soup.select('#wob_loc')[0].getText().strip()

All the magic lies in the #wob_loc, so how did I find it?

I used my browser in developer mode (Here Chrome: Option + Command + J on Mac and Control+Shift+J on Windows).

Then choose the selection tool and click on the element you want.

You see it shows you #wob_loc. (and some more) in the white box above.

This can be done similarly for all elements.

That is basically it.

Leave a Reply Cancel reply

Exit mobile version