Learn how you can become a Python programmer in just 12 weeks.

    We respect your privacy. Unsubscribe at anytime.

    How To Extract Numbers From Strings in HTML Table and Export to Excel from Python

    What will we cover in this tutorial?

    How to import a HTML table to Excel.

    But that is easy? You can do that directly from Excel.

    Yes, but what if entries contains numbers and string together, then the import will convert it to a string and makes it difficult to get the number extracted from the string.

    Luckily, we will cover how to do that easy with Python.

    Step 1: Get the dataset

    Find your favorite HTML table online. For the purpose of this tutorial I will use this one from Wikipedia with List of Metro Systems.

    View of HTML table of interest

    Say, what if we wanted to sum how many stations are in this table (please notice that the table contains more rows than shown in the above picture).

    If you import that directly into Excel, with the import functionality you will realize that the column of stations will be interpreted as strings. The problem is, that it will look like 19[13], while we are only interested in the number 19.

    There is no build in functionality to do that directly in Excel.

    But let’s try to import this into Python. We will use Pandas to do that. If you are new to Pandas, please see this tutorial.

    import pandas as pd
    
    url = "https://en.wikipedia.org/wiki/List_of_metro_systems"
    tables = pd.read_html(url)
    print(tables[0].head())
    

    Which will result in the following output.

    /Users/admin/PycharmProjects/LearningSpace/venv/bin/python /Users/admin/PycharmProjects/LearningSpace/test.py
               City    Country  ...          System length Annual ridership(millions)
    0       Algiers    Algeria  ...  18.5 km (11.5 mi)[14]           45.3 (2019)[R 1]
    1  Buenos Aires  Argentina  ...  56.7 km (35.2 mi)[16]          337.7 (2018)[R 2]
    2       Yerevan    Armenia  ...   13.4 km (8.3 mi)[17]           20.2 (2019)[R 3]
    3        Sydney  Australia  ...  36 km (22 mi)[19][20]  14.2 (2019) [R 4][R Nb 1]
    4        Vienna    Austria  ...  83.3 km (51.8 mi)[21]          459.8 (2019)[R 6]
    

    Where we have the same problem. If we inspect the type of the columns we get the following.

    City                          object
    Country                       object
    Name                          object
    Yearopened                    object
    Year of lastexpansion         object
    Stations                      object
    System length                 object
    Annual ridership(millions)    object
    dtype: object
    

    Where actually all columns are of type object, which here is equivalent to a string.

    Step 2: Extract the numbers from Stations and System length column

    The DataStructure of the tables in tables is a DataFrame, which is Pandas main data structure.

    As the strings we want to convert from string to integers are containing more information than just the numbers, we cannot use the DataFrame method to_numeric().

    We want to convert something of the form 19[13] to 19.

    To do that easily, we will use the apply(…) method on the DataFrame.

    The apply-method takes a function as argument and applies it on each row.

    We will use a lambda function as argument. If you are not familiar with lambda functions, please read this tutorial.

    import pandas as pd
    
    url = "https://en.wikipedia.org/wiki/List_of_metro_systems"
    tables = pd.read_html(url)
    table = tables[0]
    table['Stations'] = table.apply(lambda row: int(row['Stations'].split('[')[0]), axis=1)
    table['System length'] = table.apply(lambda row: float(row['System length'].split()[0]), axis=1)
    print(table[['Stations', 'System length']].head())
    

    Which will result in the following output.

       Stations  System length
    0        19           18.5
    1        90           56.7
    2        10           13.4
    3        13           36.0
    4        98           83.3
    

    This is what we want.

    Step 3: Export to Excel

    Wow. This needs an entire step?

    Well, of course it does.

    Here we need to unleash the power of Pandas and use the to_excel(…) method.

    import pandas as pd
    
    url = "https://en.wikipedia.org/wiki/List_of_metro_systems"
    tables = pd.read_html(url)
    table = tables[0]
    table['Stations'] = table.apply(lambda row: int(row['Stations'].split('[')[0]), axis=1)
    table['System length'] = table.apply(lambda row: float(row['System length'].split()[0]), axis=1)
    table.to_excel('output.xlsx')
    

    This will result in an Excel file looking similar to this, where the Stations and System length columns are numeric and not string.

    Excel file now with Stations and System length as numbers and not strings

    Python for Finance: Unlock Financial Freedom and Build Your Dream Life

    Discover the key to financial freedom and secure your dream life with Python for Finance!

    Say goodbye to financial anxiety and embrace a future filled with confidence and success. If you’re tired of struggling to pay bills and longing for a life of leisure, it’s time to take action.

    Imagine breaking free from that dead-end job and opening doors to endless opportunities. With Python for Finance, you can acquire the invaluable skill of financial analysis that will revolutionize your life.

    Make informed investment decisions, unlock the secrets of business financial performance, and maximize your money like never before. Gain the knowledge sought after by companies worldwide and become an indispensable asset in today’s competitive market.

    Don’t let your dreams slip away. Master Python for Finance and pave your way to a profitable and fulfilling career. Start building the future you deserve today!

    Python for Finance a 21 hours course that teaches investing with Python.

    Learn pandas, NumPy, Matplotlib for Financial Analysis & learn how to Automate Value Investing.

    “Excellent course for anyone trying to learn coding and investing.” – Lorenzo B.

    Leave a Comment