Pandas: Data Preparation with Vectorized Strings vs Lambda Functions

What will we cover in this tutorial?

Understand the challenge

Most of the time when you read data into a pandas DataFrame it need to be prepared.

To be more concrete, let’s look at an example. Let’s consider we want to look at the List of largest companies by revenue on Wikipedia.

From wikipedia.org

You can read an inspect the data by the following code by using a DataFrame from the pandas library.

import pandas as pd

pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 1000)


url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue'
tables = pd.read_html(url)

table = tables[0]

print(table)
print(table.dtypes)

Notice that we use pd.set_option calls to get the full view. If you are new to read_html from the pandas library, we recommend you read this tutorial.

The top of the output will be as follows.

    Rank                        Name                     Industry Revenue(USD millions) Profit(USD millions)  Employees                       Country   Ref
0      1                     Walmart                       Retail              $514,405               $6,670    2200000                 United States   [5]
1      2               Sinopec Group                  Oil and gas              $414,650               $5,845     619151                         China   [6]
2      3           Royal Dutch Shell                  Oil and gas              $396,556              $23,352      81000  Netherlands / United Kingdom   [7]
3      4    China National Petroleum                  Oil and gas              $392,976               $2,270    1382401                         China   [8]
4      5                  State Grid                  Electricity              $387,056               $8,174     917717                         China   [9]
5      6                Saudi Aramco                  Oil and gas              $355,905             $110,974      76418                  Saudi Arabia  [10]
6      7                          BP                  Oil and gas              $303,738               $9,383      73000                United Kingdom  [11]
7      8                  ExxonMobil                  Oil and gas              $290,212              $20,840      71000                 United States  [12]
8      9                  Volkswagen                   Automotive              $278,341              $14,332     664496                       Germany  [13]

And the last lines.

Rank                      int64
Name                     object
Industry                 object
Revenue(USD millions)    object
Profit(USD millions)     object
Employees                 int64
Country                  object
Ref                      object
dtype: object

Where we see interesting information about what data types each column has. Not surprisingly, the Revenue and Profit columns are of type object (which are strings in this case).

Hence, if we want to sum up values, we need to transform them to floats. This is a bit tricky, as the output shows above. An example is $6,670, where there are two issues to transform them to floats. First, there is a dollars ($) sign in the beginning. Second, there is comma (,) in the number, which a simple cast to float does not handle.

Now let’s deal with them in each their method.

Method 1: Using pandas DataFrame/Series vectorized string functions

Vectorization with pandas data structures is the process of executing operations on entire data structure. This is handy, as the alternative would be to make a loop-function.

Also, the pandas has many string functions available for vectorization as you can see in the documentation.

First of, we can access the string object by using the .str, then we can apply the string function. In our case, we will use the substring with square brackets to remove the dollar sign.

index_r = 'Revenue(USD millions)'
table[index_r] = table[index_r].str[1:]

Which will give the following output.

    Rank                        Name                     Industry Revenue(USD millions) Profit(USD millions)  Employees                       Country   Ref
0      1                     Walmart                       Retail               514,405               $6,670    2200000                 United States   [5]
1      2               Sinopec Group                  Oil and gas               414,650               $5,845     619151                         China   [6]
2      3           Royal Dutch Shell                  Oil and gas               396,556              $23,352      81000  Netherlands / United Kingdom   [7]
3      4    China National Petroleum                  Oil and gas               392,976               $2,270    1382401                         China   [8]
4      5                  State Grid                  Electricity               387,056               $8,174     917717                         China   [9]
5      6                Saudi Aramco                  Oil and gas               355,905             $110,974      76418                  Saudi Arabia  [10]
6      7                          BP                  Oil and gas               303,738               $9,383      73000                United Kingdom  [11]
7      8                  ExxonMobil                  Oil and gas               290,212              $20,840      71000                 United States  [12]
8      9                  Volkswagen                   Automotive               278,341              $14,332     664496                       Germany  [13]

Then we need to remove the comma (,). This can be done by using replace.

index_r = 'Revenue(USD millions)'
table[index_r] = table[index_r].str[1:] .str.replace(',', '')

Which will result in the following output.

    Rank                        Name                     Industry Revenue(USD millions) Profit(USD millions)  Employees                       Country   Ref
0      1                     Walmart                       Retail                514405               $6,670    2200000                 United States   [5]
1      2               Sinopec Group                  Oil and gas                414650               $5,845     619151                         China   [6]
2      3           Royal Dutch Shell                  Oil and gas                396556              $23,352      81000  Netherlands / United Kingdom   [7]
3      4    China National Petroleum                  Oil and gas                392976               $2,270    1382401                         China   [8]
4      5                  State Grid                  Electricity                387056               $8,174     917717                         China   [9]
5      6                Saudi Aramco                  Oil and gas                355905             $110,974      76418                  Saudi Arabia  [10]
6      7                          BP                  Oil and gas                303738               $9,383      73000                United Kingdom  [11]
7      8                  ExxonMobil                  Oil and gas                290212              $20,840      71000                 United States  [12]
8      9                  Volkswagen                   Automotive                278341              $14,332     664496                       Germany  [13]

Finally, we need to convert the string to a float.

index_r = 'Revenue(USD millions)'
table[index_r] = table[index_r].str[1:] .str.replace(',', '').astype(float)

Which does not change the printed output, but the type of the column.

Nice and easy, to prepare the data in one line. Notice, that you could chose to make it in multiple lines. It is a matter of taste.

Method 2: Using pandas DataFrame lambda function

Another way to prepare data is by using a lambda function. If you are new to lambda functions, we recommend you read this tutorial.

Here you can do it row by row and apply your defined lambda function.

The next column has the same challenge as the first one. So let’s apply it on that.

In this case, we cannot use the substring with square brackets like in the case above, as some figures are negative and contain that minus sign before the dollar sign. But using the replace call will do fine.

index_p = 'Profit(USD millions)'
table[index_p] = table.apply(lambda row: row[index_p].replace('$', ''), axis=1)

Which would result in the following output.

    Rank                        Name                     Industry  Revenue(USD millions) Profit(USD millions)  Employees                       Country   Ref
0      1                     Walmart                       Retail               514405.0                6,670    2200000                 United States   [5]
1      2               Sinopec Group                  Oil and gas               414650.0                5,845     619151                         China   [6]
2      3           Royal Dutch Shell                  Oil and gas               396556.0               23,352      81000  Netherlands / United Kingdom   [7]
3      4    China National Petroleum                  Oil and gas               392976.0                2,270    1382401                         China   [8]
4      5                  State Grid                  Electricity               387056.0                8,174     917717                         China   [9]
5      6                Saudi Aramco                  Oil and gas               355905.0              110,974      76418                  Saudi Arabia  [10]
6      7                          BP                  Oil and gas               303738.0                9,383      73000                United Kingdom  [11]
7      8                  ExxonMobil                  Oil and gas               290212.0               20,840      71000                 United States  [12]
8      9                  Volkswagen                   Automotive               278341.0               14,332     664496                       Germany  [13]

Then we do the same to remove the comma (,).

index_p = 'Profit(USD millions)'
table[index_p] = table.apply(lambda row: row[index_p].replace('$', '').replace(',', ''), axis=1)

Which result in the following output.

    Rank                        Name                     Industry  Revenue(USD millions) Profit(USD millions)  Employees                       Country   Ref
0      1                     Walmart                       Retail               514405.0                 6670    2200000                 United States   [5]
1      2               Sinopec Group                  Oil and gas               414650.0                 5845     619151                         China   [6]
2      3           Royal Dutch Shell                  Oil and gas               396556.0                23352      81000  Netherlands / United Kingdom   [7]
3      4    China National Petroleum                  Oil and gas               392976.0                 2270    1382401                         China   [8]
4      5                  State Grid                  Electricity               387056.0                 8174     917717                         China   [9]
5      6                Saudi Aramco                  Oil and gas               355905.0               110974      76418                  Saudi Arabia  [10]
6      7                          BP                  Oil and gas               303738.0                 9383      73000                United Kingdom  [11]
7      8                  ExxonMobil                  Oil and gas               290212.0                20840      71000                 United States  [12]
8      9                  Volkswagen                   Automotive               278341.0                14332     664496                       Germany  [13]

Finally, we will do the same for casting it to a float.

index_p = 'Profit(USD millions)'
table[index_p] = table.apply(lambda row: float(row[index_p].replace('$', '').replace(',', '')), axis=1)

Which will produce the same output.

Comparing the two methods

To be honest, it is a matter of taste in this case. When things can be achieved by simple string manipulation calls that are available through the vectorized calls, there is nothing to gain by lambda functions.

The strength of lambda functions is the flexibility. You can actually do anything function in there, which is a big strength. The vectorized functions are limited to simple operations, which covers a lot of use cases.

Putting it all together

Well, now we came so far, let’s put it all together and get some nice data. Sum it up and print it sorted out and make a horizontal bar plot.

import pandas as pd
import matplotlib.pyplot as plt

pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 1000)


url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue'
tables = pd.read_html(url)

table = tables[0]

index_r = 'Revenue(USD millions)'
table[index_r] = table[index_r].str[1:] .str.replace(',', '').astype(float)

index_p = 'Profit(USD millions)'
table[index_p] = table.apply(lambda row: float(row[index_p].replace('$', '').replace(',', '')), axis=1)

table = table.drop(['Rank'], axis=1)


print(table.groupby('Country').sum().sort_values([index_r, index_p], ascending=False))

table.groupby('Industry').sum().sort_values([index_r, index_p], ascending=False).plot.barh()
plt.show()

The output graph.

Resulting output graph.

And the output from the program.

                              Revenue(USD millions)  Profit(USD millions)  Employees
Country                                                                             
United States                             4169049.0              243970.0    6585076
China                                     2263521.0              182539.0    5316321
Germany                                    602635.0               31693.0    1105639
Japan                                      561157.0               27814.0     670636
Netherlands / United Kingdom               396556.0               23352.0      81000
Saudi Arabia                               355905.0              110974.0      76418
France                                     309684.0               13971.0     208525
United Kingdom                             303738.0                9383.0      73000
Russia                                     250447.0               33062.0     568600
South Korea                                221579.0               39895.0     221579
Switzerland                                219754.0                3408.0      85504
Singapore                                  180744.0                 849.0       4316
Taiwan                                     175617.0                4281.0     667680
Netherlands                                175009.0                1589.0     314790

Pandas: How to Sum Groups from HTML Tables

What will we cover in this tutorial?

  • How to collect data from a HTML table into a Pandas DataFrame.
  • The cleaning process and how to convert the data into the correct type.
  • Also, dealing with some data points that are not in correct representation.
  • Finally, how to sum up by countries.

Step 1: Collect the data from the table

Pandas is an amazing library with a lot of useful data analysis functionality right out of the box. First step in any data analysis is to collect the data. In this tutorial we will collect the data from wikipedia’s page on List of metro systems.

If you are new to the pandas library we recommend you read the this tutorial.

The objective will be to find the sums of Stations, Systems length, and Annual ridership per each country.

From wikipedia.org

At first glance this looks simple, but looking further down we see that some countries have various rows.

From wikipedia.org

Also, some rows do not have all the values needed.

First challenge first. Read the data from the table into a DataFrame, which is the main data structure of the pandas library. The read_html call from a pandas will return a list of DataFrames.

If you use read_html for the first time, we recommend you read this tutorial.

import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_metro_systems'
tables = pd.read_html(url)
table = tables[0]
print(table)

Which results in the following output (or the top of it).

                 City               Country                                Name        Yearopened Year of lastexpansion             Stations                       System length             Annual ridership(millions)
0             Algiers               Algeria                       Algiers Metro          2011[13]              2018[14]               19[14]               18.5 km (11.5 mi)[15]                       45.3 (2019)[R 1]
1        Buenos Aires             Argentina            Buenos Aires Underground        1926[Nb 1]              2019[16]               90[17]               56.7 km (35.2 mi)[17]                      337.7 (2018)[R 2]
2             Yerevan               Armenia                       Yerevan Metro          1981[18]              1996[19]               10[18]                13.4 km (8.3 mi)[18]                       18.7 (2018)[R 3]
3              Sydney             Australia                        Sydney Metro          2019[20]                     –               13[20]               36 km (22 mi)[20][21]              14.2 (2019) [R 4][R Nb 1]
4              Vienna               Austria                       Vienna U-Bahn    1976[22][Nb 2]              2017[23]               98[24]               83.3 km (51.8 mi)[22]                      463.1 (2018)[R 6]
5                Baku            Azerbaijan                          Baku Metro          1967[25]              2016[25]               25[25]               36.6 km (22.7 mi)[25]                      231.0 (2018)[R 3]

We have now have the data in a DataFrame.

Step 2: Clean and convert the data

At first glance, we see that we do not need the rows City, Name, Yearopened, Year of last expansion. To make it easier to work with the data, let’s remove them and inspect the data again.

import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_metro_systems'
tables = pd.read_html(url)
table = tables[0]
table = table.drop(['City', 'Name', 'Yearopened', 'Year of lastexpansion'], axis=1)
print(table)

Which result in the following output.

                  Country             Stations                       System length             Annual ridership(millions)
0                 Algeria               19[14]               18.5 km (11.5 mi)[15]                       45.3 (2019)[R 1]
1               Argentina               90[17]               56.7 km (35.2 mi)[17]                      337.7 (2018)[R 2]
2                 Armenia               10[18]                13.4 km (8.3 mi)[18]                       18.7 (2018)[R 3]
3               Australia               13[20]               36 km (22 mi)[20][21]              14.2 (2019) [R 4][R Nb 1]
4                 Austria               98[24]               83.3 km (51.8 mi)[22]                      463.1 (2018)[R 6]
5              Azerbaijan               25[25]               36.6 km (22.7 mi)[25]                      231.0 (2018)[R 3]
6                 Belarus               29[27]               37.3 km (23.2 mi)[27]                      283.4 (2018)[R 3]
7                 Belgium         59[28][Nb 5]               39.9 km (24.8 mi)[29]                      165.3 (2019)[R 7]

This makes it easier to see the next steps.

Let’s take them one by one. Stations need to remove the data after ‘[‘-symbol and convert the number to an integer. This can be done by using a lambda function to a row.

table['Stations'] = table.apply(lambda row: int(row['Stations'].split('[')[0]), axis=1)

If you are new to lambda functions we recommend you read this tutorial.

The next thing we need to do is to convert the System length to floats. The length will be in km (I live in Denmark, where we use km and not mi). This can also be done by using a lambda function

table['System length'] = table.apply(lambda row: float(row['System length'].split()[0]), axis=1)

Finally, and a bit more tricky, we need to convert the column of Annual ridership. The challenge is that lines have n/a which are converted to np.nan, but there are also some lines where the input is not easy to convert, as the images show.

From wikipedia.org
From wikipedia.org

These lines are can be dealt with by using a helper function.

def to_float(obj):
    try:
        return float(obj)
    except:
        return np.nan

index = 'Annual ridership(millions)'
table[index] = table.apply(lambda row: to_float(row[index].split()[0]) if row[index] is not np.nan else np.nan, axis=1)

Adding this all together we get the following code.

import pandas as pd
import numpy as np

def to_float(obj):
    try:
        return float(obj)
    except:
        return np.nan

url = 'https://en.wikipedia.org/wiki/List_of_metro_systems'
tables = pd.read_html(url)
table = tables[0]
table = table.drop(['City', 'Name', 'Yearopened', 'Year of lastexpansion'], axis=1)

table['Stations'] = table.apply(lambda row: int(row['Stations'].split('[')[0]), axis=1)
table['System length'] = table.apply(lambda row: float(row['System length'].split()[0]), axis=1)
index = 'Annual ridership(millions)'
table[index] = table.apply(lambda row: to_float(row[index].split()[0]) if row[index] is not np.nan else np.nan, axis=1)

print(table)

Which results in the following output (or the first few lines).

                  Country  Stations  System length  Annual ridership(millions)
0                 Algeria        19          18.50                       45.30
1               Argentina        90          56.70                      337.70
2                 Armenia        10          13.40                       18.70
3               Australia        13          36.00                       14.20
4                 Austria        98          83.30                      463.10
5              Azerbaijan        25          36.60                      231.00
6                 Belarus        29          37.30                      283.40
7                 Belgium        59          39.90                      165.30
8                  Brazil        19          28.10                       58.40
9                  Brazil        25          42.40                       42.80
10                 Brazil        22          43.80                       51.70

Step 3: Sum rows by country

Say, now we want to get the country with the most metro stations. This can be achieved by using the groupby and sum function from the pandas DataFrame data structure.

import pandas as pd
import numpy as np

def to_float(obj):
    try:
        return float(obj)
    except:
        return np.nan

url = 'https://en.wikipedia.org/wiki/List_of_metro_systems'
tables = pd.read_html(url)
table = tables[0]
table = table.drop(['City', 'Name', 'Yearopened', 'Year of lastexpansion'], axis=1)

table['Stations'] = table.apply(lambda row: int(row['Stations'].split('[')[0]), axis=1)
table['System length'] = table.apply(lambda row: float(row['System length'].split()[0]), axis=1)
index = 'Annual ridership(millions)'
table[index] = table.apply(lambda row: to_float(row[index].split()[0]) if row[index] is not np.nan else np.nan, axis=1)

# Sum up
table_sum = table.groupby(['Country']).sum()

print(table_sum.sort_values(['Stations'], ascending=False))

Where the result will be China.

                      Stations  System length  Annual ridership(millions)
Country                                                                  
China                     3738        6312.16                    25519.23
United States             1005        1325.90                     2771.50
South Korea                714         839.90                     4054.90
Japan[Nb 34]               669         791.20                     6489.60
India                      499         675.97                     1377.00
France                     483         350.90                     2113.50
Spain                      438         474.40                     1197.90

If we want to sort by km of System length, you will only need to change the last line to the following.

print(table_sum.sort_values(['System length'], ascending=False))

Resulting in the following.

                      Stations  System length  Annual ridership(millions)
Country                                                                  
China                     3738        6312.16                    25519.23
United States             1005        1325.90                     2771.50
South Korea                714         839.90                     4054.90
Japan[Nb 34]               669         791.20                     6489.60
India                      499         675.97                     1377.00
Russia                     368         611.50                     3507.60
United Kingdom             390         523.90                     1555.30

Finally, if you want it by Annual ridership, you will need to change the last line to.

print(table_sum.sort_values([index], ascending=False))

Remember, we assigned that to index. You should get the following output.

                      Stations  System length  Annual ridership(millions)
Country                                                                  
China                     3738        6312.16                    25519.23
Japan[Nb 34]               669         791.20                     6489.60
South Korea                714         839.90                     4054.90
Russia                     368         611.50                     3507.60
United States             1005        1325.90                     2771.50
France                     483         350.90                     2113.50
Brazil                     243         345.40                     2106.20