Web Scraping and Data Wrangling with pandas in 3 Steps

What will we cover?

In this tutorial you will learn how to

  • Web Scrape with pandas
  • How to do common Data Wrangling on data from web

Step 1: What is Web Scraping and Considerations Before You Start

Web Scraping is extracting data from websites.

The legality of Web Scraping varies across the world. In general, Web Scraping may be against the term of use of some websites, but the enforceability of these terms is unclear (source: wikipedia).

In general, if you only use it for privately, you should be on the safe side.

Step 2: The Web Scraping with pandas

Look at this page: Fundraising statistics.

This page has a table of data and what makes it interesting is that the values are given with dollars sign commas in the values. When we scrape them you will see that they are represented as strings.

Are you ready to try?

import pandas as pd

url = "https://en.wikipedia.org/wiki/Wikipedia:Fundraising_statistics"

data = pd.read_html(url)

fundraising = data[0]

print(fundraising.head())
Output

It might not be clear, but the columns Revenue, Expenses, Asset rise, Total assets are all strings.

A note on the code read_html(url) returns a list of the tables on the webpage. In this case the first table is the one we are using. This might from time to time take some experimentation.

Step 3: Data Wrangling

Data Wrangling (also called Data Munging) is transforming and mapping data from one raw form into another format. With the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.

We want to convert the values in the columns to numbers, such that we can use them for further processing.

This can be done in a few steps.

# Remove the dollar sign and space in the beginning of string
fundraising['Expenses'] = fundraising['Expenses'].str[2:]
# Remove all commas
fundraising['Expenses'] = fundraising['Expenses'].str.replace(',', '')
# Convert to numeric
fundraising['Expenses'] = pd.to_numeric(fundraising['Expenses'])

This can be done in one line.

fundraising['Revenue'] = pd.to_numeric(fundraising['Revenue'].str[2:].str.replace(',', ''))
fundraising['Expenses'] = pd.to_numeric(fundraising['Expenses'].str[2:].str.replace(',', ''))
fundraising['Asset rise'] = pd.to_numeric(fundraising['Asset rise'].str[2:].str.replace(',', ''))
fundraising['Total assets'] = pd.to_numeric(fundraising['Total assets'].str[2:].str.replace(',', ''))

This will convert all columns to numeric.

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
Data Science

Leave a Reply Cancel reply

Exit mobile version