Read HTML Tables with Pandas – Count and Sum Values in Groups

What will this tutorial cover

  • Pandas has a built-in library function to read html tables directly. The library call depends on other libraries, which needs to be installed first.
  • Often data needs to be cleaned and transformed into the right format to proceed with processing.
  • Finally, we will group the data and count occurrences and sum up a value attribute.

We will cover all that with real life example by using an example of a table from a wikipedia page.

Step 1: What you need to get started

First of, you need to install the pandas library, which can be done by using pip.

pip install pandas

Or see here.

Using the read_html call (with will read the URL argument, parse the data from all tables and return them) from the library need installation of further supporting libraries. If you run this example.

import pandas as pd


url = 'https://en.wikipedia.org/wiki/List_of_elevator_accidents'
tables = pd.read_html(url)

print(tables)

You will get the following error.

Traceback (most recent call last):
  File "/Users/admin/PycharmProjects/MyLearningSpace/Pandas_read_html.py", line 5, in <module>
    table = pd.read_html(url)
  File "/Users/admin/PycharmProjects/MyLearningSpace/venv/lib/python3.8/site-packages/pandas/io/html.py", line 1085, in read_html
    return _parse(
  File "/Users/admin/PycharmProjects/MyLearningSpace/venv/lib/python3.8/site-packages/pandas/io/html.py", line 891, in _parse
    parser = _parser_dispatch(flav)
  File "/Users/admin/PycharmProjects/MyLearningSpace/venv/lib/python3.8/site-packages/pandas/io/html.py", line 848, in _parser_dispatch
    raise ImportError("lxml not found, please install it")
ImportError: lxml not found, please install it

Which states you also need to install the lxml library to use the call. You can install that by.

pip install lxml.

The lxml library is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.

Then you should get the expected result.

The read_html call returns a list of DataFrames.

Hence you can see how many by the following code.

import pandas as pd


url = 'https://en.wikipedia.org/wiki/List_of_elevator_accidents'
tables = pd.read_html(url)

print(len(tables))

Which tells you it contains 9 (at the time of writing – changes happens to wikipedia pages).

In this tutorial we are interested in the main table on the page, which you can find in the first entry (at the time of writing).

Hence, you can access the DataFrame through this code.

import pandas as pd


url = 'https://en.wikipedia.org/wiki/List_of_elevator_accidents'
tables = pd.read_html(url)

table = tables[0]

Step 2: Clean data to prepare it for sum and count

The next thing you want to do is to use the values in the table. First thing you notice is, that you both need to clean and cast values.

Let’s say you want to organize the data into countries. This can be done by help of Lambda function. Also see this tutorial if you are new to Lambda functions.

import pandas as pd


url = 'https://en.wikipedia.org/wiki/List_of_elevator_accidents'
tables = pd.read_html(url)

table = tables[0]

table['Country'] = table.apply(lambda row: row['Location'].split(', ')[-1], axis=1)

Where the magic last line will add a new column to the data with the country in it.

               Deaths Injuries  ...                       Location       Country
0  200-400 (estimate)        1  ...  New York City, New York, U.S.          U.S.
1                 104        0  ...           Orkney, South Africa  South Africa
2                  52        0  ...           Welkom, South Africa  South Africa
3                  19        0  ...                   Wuhan, China         China
4                  19        1  ...                 Leigh, England       England

As you see, it uses the split on Location and only adds the last value, the country itself.

Next up, you see that if we want to sum up the deaths by country we need to make a decisions, as some are in intervals. For this purpose, we are conservative and take the lower bound of it.

import pandas as pd


url = 'https://en.wikipedia.org/wiki/List_of_elevator_accidents'
tables = pd.read_html(url)

table = tables[0]

table['Country'] = table.apply(lambda row: row['Location'].split(', ')[-1], axis=1)
table['Deaths (clean)'] = table.apply(lambda row: int(row['Deaths']) if row['Deaths'].isdigit() else int(row['Deaths'].split('-')[0]), axis=1)
print(table.head())

Which results in the following.

               Deaths Injuries  ...       Country Deaths (clean)
0  200-400 (estimate)        1  ...          U.S.           200
1                 104        0  ...  South Africa           104
2                  52        0  ...  South Africa            52
3                  19        0  ...         China            19
4                  19        1  ...       England            19

Also notice, that the Deaths (clean) column is converted to an int by the lambda function.

Step 3: Count and sum by country

Now we need to group it by Country first to get the sum of deaths and then count of how many accidents were there.

import pandas as pd


url = 'https://en.wikipedia.org/wiki/List_of_elevator_accidents'
tables = pd.read_html(url)

table = tables[0]

table['Country'] = table.apply(lambda row: row['Location'].split(', ')[-1], axis=1)
table['Death (clean)'] = table.apply(lambda row: int(row['Deaths']) if row['Deaths'].isdigit() else int(row['Deaths'].split('-')[0]), axis=1)

country = table.groupby('Country').count()[['Location']].sort_values(by='Location', ascending=False)
country = country.rename(columns={'Location': 'Events'})
country['Total Deaths'] = table.groupby('Country')['Death (clean)'].sum()

Which will result in the following.

                         Events  Total Deaths
Country                                      
U.S.                         13           239
China                         5            55
England                       3            41
Spain                         2             9
South Korea                   2             8
South Africa                  2           156
Brazil                        2            13
Australia                     1             4
Netherlands                   1             2
Turkey                        1            10
Pakistan                      1             6
Iran                          1             6
Mogok Township. Myanmar       1             6
India                         1             6
Hong Kong                     1            12
Germany                       1            10
Canada                        1             4
Cambodia                      1             4
Ukraine                       1            11

That is it.

Leave a Reply