Learn how you can become a Python programmer in just 12 weeks.

    We respect your privacy. Unsubscribe at anytime.

    Read HTML Tables with Pandas – Count and Sum Values in Groups

    What will this tutorial cover

    • Pandas has a built-in library function to read html tables directly. The library call depends on other libraries, which needs to be installed first.
    • Often data needs to be cleaned and transformed into the right format to proceed with processing.
    • Finally, we will group the data and count occurrences and sum up a value attribute.

    We will cover all that with real life example by using an example of a table from a wikipedia page.

    Or see it on YouTube.

    Step 1: What you need to get started

    First of, you need to install the pandas library, which can be done by using pip.

    pip install pandas
    

    Or see here.

    Using the read_html call (with will read the URL argument, parse the data from all tables and return them) from the library need installation of further supporting libraries. If you run this example.

    import pandas as pd
    
    url = 'https://en.wikipedia.org/wiki/List_of_elevator_accidents'
    tables = pd.read_html(url)
    print(tables)
    

    You will get the following error.

    Traceback (most recent call last):
      File "/Users/admin/PycharmProjects/MyLearningSpace/Pandas_read_html.py", line 5, in <module>
        table = pd.read_html(url)
      File "/Users/admin/PycharmProjects/MyLearningSpace/venv/lib/python3.8/site-packages/pandas/io/html.py", line 1085, in read_html
        return _parse(
      File "/Users/admin/PycharmProjects/MyLearningSpace/venv/lib/python3.8/site-packages/pandas/io/html.py", line 891, in _parse
        parser = _parser_dispatch(flav)
      File "/Users/admin/PycharmProjects/MyLearningSpace/venv/lib/python3.8/site-packages/pandas/io/html.py", line 848, in _parser_dispatch
        raise ImportError("lxml not found, please install it")
    ImportError: lxml not found, please install it
    

    Which states you also need to install the lxml library to use the call. You can install that by.

    pip install lxml.
    

    The lxml library is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.

    Then you should get the expected result.

    The read_html call returns a list of DataFrames.

    Hence you can see how many by the following code.

    import pandas as pd
    
    url = 'https://en.wikipedia.org/wiki/List_of_elevator_accidents'
    tables = pd.read_html(url)
    print(len(tables))
    

    Which tells you it contains 9 (at the time of writing – changes happens to wikipedia pages).

    In this tutorial we are interested in the main table on the page, which you can find in the first entry (at the time of writing).

    Hence, you can access the DataFrame through this code.

    import pandas as pd
    
    url = 'https://en.wikipedia.org/wiki/List_of_elevator_accidents'
    tables = pd.read_html(url)
    table = tables[0]
    

    Step 2: Clean data to prepare it for sum and count

    The next thing you want to do is to use the values in the table. First thing you notice is, that you both need to clean and cast values.

    Let’s say you want to organize the data into countries. This can be done by help of Lambda function. Also see this tutorial if you are new to Lambda functions.

    import pandas as pd
    
    url = 'https://en.wikipedia.org/wiki/List_of_elevator_accidents'
    tables = pd.read_html(url)
    table = tables[0]
    table['Country'] = table.apply(lambda row: row['Location'].split(', ')[-1], axis=1)
    

    Where the magic last line will add a new column to the data with the country in it.

                   Deaths Injuries  ...                       Location       Country
    0  200-400 (estimate)        1  ...  New York City, New York, U.S.          U.S.
    1                 104        0  ...           Orkney, South Africa  South Africa
    2                  52        0  ...           Welkom, South Africa  South Africa
    3                  19        0  ...                   Wuhan, China         China
    4                  19        1  ...                 Leigh, England       England
    

    As you see, it uses the split on Location and only adds the last value, the country itself.

    Next up, you see that if we want to sum up the deaths by country we need to make a decisions, as some are in intervals. For this purpose, we are conservative and take the lower bound of it.

    import pandas as pd
    
    url = 'https://en.wikipedia.org/wiki/List_of_elevator_accidents'
    tables = pd.read_html(url)
    table = tables[0]
    table['Country'] = table.apply(lambda row: row['Location'].split(', ')[-1], axis=1)
    table['Deaths (clean)'] = table.apply(lambda row: int(row['Deaths']) if row['Deaths'].isdigit() else int(row['Deaths'].split('-')[0]), axis=1)
    print(table.head())
    

    Which results in the following.

                   Deaths Injuries  ...       Country Deaths (clean)
    0  200-400 (estimate)        1  ...          U.S.           200
    1                 104        0  ...  South Africa           104
    2                  52        0  ...  South Africa            52
    3                  19        0  ...         China            19
    4                  19        1  ...       England            19
    

    Also notice, that the Deaths (clean) column is converted to an int by the lambda function.

    Step 3: Count and sum by country

    Now we need to group it by Country first to get the sum of deaths and then count of how many accidents were there.

    import pandas as pd
    
    url = 'https://en.wikipedia.org/wiki/List_of_elevator_accidents'
    tables = pd.read_html(url)
    table = tables[0]
    table['Country'] = table.apply(lambda row: row['Location'].split(', ')[-1], axis=1)
    table['Death (clean)'] = table.apply(lambda row: int(row['Deaths']) if row['Deaths'].isdigit() else int(row['Deaths'].split('-')[0]), axis=1)
    country = table.groupby('Country').count()[['Location']].sort_values(by='Location', ascending=False)
    country = country.rename(columns={'Location': 'Events'})
    country['Total Deaths'] = table.groupby('Country')['Death (clean)'].sum()
    

    Which will result in the following.

                             Events  Total Deaths
    Country                                      
    U.S.                         13           239
    China                         5            55
    England                       3            41
    Spain                         2             9
    South Korea                   2             8
    South Africa                  2           156
    Brazil                        2            13
    Australia                     1             4
    Netherlands                   1             2
    Turkey                        1            10
    Pakistan                      1             6
    Iran                          1             6
    Mogok Township. Myanmar       1             6
    India                         1             6
    Hong Kong                     1            12
    Germany                       1            10
    Canada                        1             4
    Cambodia                      1             4
    Ukraine                       1            11
    

    That is it.

    Python for Finance: Unlock Financial Freedom and Build Your Dream Life

    Discover the key to financial freedom and secure your dream life with Python for Finance!

    Say goodbye to financial anxiety and embrace a future filled with confidence and success. If you’re tired of struggling to pay bills and longing for a life of leisure, it’s time to take action.

    Imagine breaking free from that dead-end job and opening doors to endless opportunities. With Python for Finance, you can acquire the invaluable skill of financial analysis that will revolutionize your life.

    Make informed investment decisions, unlock the secrets of business financial performance, and maximize your money like never before. Gain the knowledge sought after by companies worldwide and become an indispensable asset in today’s competitive market.

    Don’t let your dreams slip away. Master Python for Finance and pave your way to a profitable and fulfilling career. Start building the future you deserve today!

    Python for Finance a 21 hours course that teaches investing with Python.

    Learn pandas, NumPy, Matplotlib for Financial Analysis & learn how to Automate Value Investing.

    “Excellent course for anyone trying to learn coding and investing.” – Lorenzo B.

    Leave a Comment