What will we cover in this tutorial?
- Compare the art of data preparation using pandas DataFrames.
- By using the vectorized string functions provided by pandas DataFrames.
- …and by applying lambda functions
Understand the challenge
Most of the time when you read data into a pandas DataFrame it need to be prepared.
To be more concrete, let’s look at an example. Let’s consider we want to look at the List of largest companies by revenue on Wikipedia.

You can read an inspect the data by the following code by using a DataFrame from the pandas library.
import pandas as pd
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 1000)
url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue'
tables = pd.read_html(url)
table = tables[0]
print(table)
print(table.dtypes)
Notice that we use pd.set_option calls to get the full view. If you are new to read_html from the pandas library, we recommend you read this tutorial.
The top of the output will be as follows.
Rank Name Industry Revenue(USD millions) Profit(USD millions) Employees Country Ref
0 1 Walmart Retail $514,405 $6,670 2200000 United States [5]
1 2 Sinopec Group Oil and gas $414,650 $5,845 619151 China [6]
2 3 Royal Dutch Shell Oil and gas $396,556 $23,352 81000 Netherlands / United Kingdom [7]
3 4 China National Petroleum Oil and gas $392,976 $2,270 1382401 China [8]
4 5 State Grid Electricity $387,056 $8,174 917717 China [9]
5 6 Saudi Aramco Oil and gas $355,905 $110,974 76418 Saudi Arabia [10]
6 7 BP Oil and gas $303,738 $9,383 73000 United Kingdom [11]
7 8 ExxonMobil Oil and gas $290,212 $20,840 71000 United States [12]
8 9 Volkswagen Automotive $278,341 $14,332 664496 Germany [13]
And the last lines.
Rank int64
Name object
Industry object
Revenue(USD millions) object
Profit(USD millions) object
Employees int64
Country object
Ref object
dtype: object
Where we see interesting information about what data types each column has. Not surprisingly, the Revenue and Profit columns are of type object (which are strings in this case).
Hence, if we want to sum up values, we need to transform them to floats. This is a bit tricky, as the output shows above. An example is $6,670, where there are two issues to transform them to floats. First, there is a dollars ($) sign in the beginning. Second, there is comma (,) in the number, which a simple cast to float does not handle.
Now let’s deal with them in each their method.
Method 1: Using pandas DataFrame/Series vectorized string functions
Vectorization with pandas data structures is the process of executing operations on entire data structure. This is handy, as the alternative would be to make a loop-function.
Also, the pandas has many string functions available for vectorization as you can see in the documentation.
First of, we can access the string object by using the .str, then we can apply the string function. In our case, we will use the substring with square brackets to remove the dollar sign.
index_r = 'Revenue(USD millions)'
table[index_r] = table[index_r].str[1:]
Which will give the following output.
Rank Name Industry Revenue(USD millions) Profit(USD millions) Employees Country Ref
0 1 Walmart Retail 514,405 $6,670 2200000 United States [5]
1 2 Sinopec Group Oil and gas 414,650 $5,845 619151 China [6]
2 3 Royal Dutch Shell Oil and gas 396,556 $23,352 81000 Netherlands / United Kingdom [7]
3 4 China National Petroleum Oil and gas 392,976 $2,270 1382401 China [8]
4 5 State Grid Electricity 387,056 $8,174 917717 China [9]
5 6 Saudi Aramco Oil and gas 355,905 $110,974 76418 Saudi Arabia [10]
6 7 BP Oil and gas 303,738 $9,383 73000 United Kingdom [11]
7 8 ExxonMobil Oil and gas 290,212 $20,840 71000 United States [12]
8 9 Volkswagen Automotive 278,341 $14,332 664496 Germany [13]
Then we need to remove the comma (,). This can be done by using replace.
index_r = 'Revenue(USD millions)'
table[index_r] = table[index_r].str[1:] .str.replace(',', '')
Which will result in the following output.
Rank Name Industry Revenue(USD millions) Profit(USD millions) Employees Country Ref
0 1 Walmart Retail 514405 $6,670 2200000 United States [5]
1 2 Sinopec Group Oil and gas 414650 $5,845 619151 China [6]
2 3 Royal Dutch Shell Oil and gas 396556 $23,352 81000 Netherlands / United Kingdom [7]
3 4 China National Petroleum Oil and gas 392976 $2,270 1382401 China [8]
4 5 State Grid Electricity 387056 $8,174 917717 China [9]
5 6 Saudi Aramco Oil and gas 355905 $110,974 76418 Saudi Arabia [10]
6 7 BP Oil and gas 303738 $9,383 73000 United Kingdom [11]
7 8 ExxonMobil Oil and gas 290212 $20,840 71000 United States [12]
8 9 Volkswagen Automotive 278341 $14,332 664496 Germany [13]
Finally, we need to convert the string to a float.
index_r = 'Revenue(USD millions)'
table[index_r] = table[index_r].str[1:] .str.replace(',', '').astype(float)
Which does not change the printed output, but the type of the column.
Nice and easy, to prepare the data in one line. Notice, that you could chose to make it in multiple lines. It is a matter of taste.
Method 2: Using pandas DataFrame lambda function
Another way to prepare data is by using a lambda function. If you are new to lambda functions, we recommend you read this tutorial.
Here you can do it row by row and apply your defined lambda function.
The next column has the same challenge as the first one. So let’s apply it on that.
In this case, we cannot use the substring with square brackets like in the case above, as some figures are negative and contain that minus sign before the dollar sign. But using the replace call will do fine.
index_p = 'Profit(USD millions)'
table[index_p] = table.apply(lambda row: row[index_p].replace('$', ''), axis=1)
Which would result in the following output.
Rank Name Industry Revenue(USD millions) Profit(USD millions) Employees Country Ref
0 1 Walmart Retail 514405.0 6,670 2200000 United States [5]
1 2 Sinopec Group Oil and gas 414650.0 5,845 619151 China [6]
2 3 Royal Dutch Shell Oil and gas 396556.0 23,352 81000 Netherlands / United Kingdom [7]
3 4 China National Petroleum Oil and gas 392976.0 2,270 1382401 China [8]
4 5 State Grid Electricity 387056.0 8,174 917717 China [9]
5 6 Saudi Aramco Oil and gas 355905.0 110,974 76418 Saudi Arabia [10]
6 7 BP Oil and gas 303738.0 9,383 73000 United Kingdom [11]
7 8 ExxonMobil Oil and gas 290212.0 20,840 71000 United States [12]
8 9 Volkswagen Automotive 278341.0 14,332 664496 Germany [13]
Then we do the same to remove the comma (,).
index_p = 'Profit(USD millions)'
table[index_p] = table.apply(lambda row: row[index_p].replace('$', '').replace(',', ''), axis=1)
Which result in the following output.
Rank Name Industry Revenue(USD millions) Profit(USD millions) Employees Country Ref
0 1 Walmart Retail 514405.0 6670 2200000 United States [5]
1 2 Sinopec Group Oil and gas 414650.0 5845 619151 China [6]
2 3 Royal Dutch Shell Oil and gas 396556.0 23352 81000 Netherlands / United Kingdom [7]
3 4 China National Petroleum Oil and gas 392976.0 2270 1382401 China [8]
4 5 State Grid Electricity 387056.0 8174 917717 China [9]
5 6 Saudi Aramco Oil and gas 355905.0 110974 76418 Saudi Arabia [10]
6 7 BP Oil and gas 303738.0 9383 73000 United Kingdom [11]
7 8 ExxonMobil Oil and gas 290212.0 20840 71000 United States [12]
8 9 Volkswagen Automotive 278341.0 14332 664496 Germany [13]
Finally, we will do the same for casting it to a float.
index_p = 'Profit(USD millions)'
table[index_p] = table.apply(lambda row: float(row[index_p].replace('$', '').replace(',', '')), axis=1)
Which will produce the same output.
Comparing the two methods
To be honest, it is a matter of taste in this case. When things can be achieved by simple string manipulation calls that are available through the vectorized calls, there is nothing to gain by lambda functions.
The strength of lambda functions is the flexibility. You can actually do anything function in there, which is a big strength. The vectorized functions are limited to simple operations, which covers a lot of use cases.
Putting it all together
Well, now we came so far, let’s put it all together and get some nice data. Sum it up and print it sorted out and make a horizontal bar plot.
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 1000)
url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue'
tables = pd.read_html(url)
table = tables[0]
index_r = 'Revenue(USD millions)'
table[index_r] = table[index_r].str[1:] .str.replace(',', '').astype(float)
index_p = 'Profit(USD millions)'
table[index_p] = table.apply(lambda row: float(row[index_p].replace('$', '').replace(',', '')), axis=1)
table = table.drop(['Rank'], axis=1)
print(table.groupby('Country').sum().sort_values([index_r, index_p], ascending=False))
table.groupby('Industry').sum().sort_values([index_r, index_p], ascending=False).plot.barh()
plt.show()
The output graph.

And the output from the program.
Revenue(USD millions) Profit(USD millions) Employees
Country
United States 4169049.0 243970.0 6585076
China 2263521.0 182539.0 5316321
Germany 602635.0 31693.0 1105639
Japan 561157.0 27814.0 670636
Netherlands / United Kingdom 396556.0 23352.0 81000
Saudi Arabia 355905.0 110974.0 76418
France 309684.0 13971.0 208525
United Kingdom 303738.0 9383.0 73000
Russia 250447.0 33062.0 568600
South Korea 221579.0 39895.0 221579
Switzerland 219754.0 3408.0 85504
Singapore 180744.0 849.0 4316
Taiwan 175617.0 4281.0 667680
Netherlands 175009.0 1589.0 314790
Python for Finance: Unlock Financial Freedom and Build Your Dream Life
Discover the key to financial freedom and secure your dream life with Python for Finance!
Say goodbye to financial anxiety and embrace a future filled with confidence and success. If you’re tired of struggling to pay bills and longing for a life of leisure, it’s time to take action.
Imagine breaking free from that dead-end job and opening doors to endless opportunities. With Python for Finance, you can acquire the invaluable skill of financial analysis that will revolutionize your life.
Make informed investment decisions, unlock the secrets of business financial performance, and maximize your money like never before. Gain the knowledge sought after by companies worldwide and become an indispensable asset in today’s competitive market.
Don’t let your dreams slip away. Master Python for Finance and pave your way to a profitable and fulfilling career. Start building the future you deserve today!
Python for Finance a 21 hours course that teaches investing with Python.
Learn pandas, NumPy, Matplotlib for Financial Analysis & learn how to Automate Value Investing.
“Excellent course for anyone trying to learn coding and investing.” – Lorenzo B.
