What will we cover in this tutorial?
- How to collect data from a HTML table into a Pandas DataFrame.
- The cleaning process and how to convert the data into the correct type.
- Also, dealing with some data points that are not in correct representation.
- Finally, how to sum up by countries.
Step 1: Collect the data from the table
Pandas is an amazing library with a lot of useful data analysis functionality right out of the box. First step in any data analysis is to collect the data. In this tutorial we will collect the data from wikipedia’s page on List of metro systems.
If you are new to the pandas library we recommend you read the this tutorial.
The objective will be to find the sums of Stations, Systems length, and Annual ridership per each country.

At first glance this looks simple, but looking further down we see that some countries have various rows.

Also, some rows do not have all the values needed.
First challenge first. Read the data from the table into a DataFrame, which is the main data structure of the pandas library. The read_html call from a pandas will return a list of DataFrames.
If you use read_html for the first time, we recommend you read this tutorial.
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_metro_systems'
tables = pd.read_html(url)
table = tables[0]
print(table)
Which results in the following output (or the top of it).
City Country Name Yearopened Year of lastexpansion Stations System length Annual ridership(millions)
0 Algiers Algeria Algiers Metro 2011[13] 2018[14] 19[14] 18.5 km (11.5 mi)[15] 45.3 (2019)[R 1]
1 Buenos Aires Argentina Buenos Aires Underground 1926[Nb 1] 2019[16] 90[17] 56.7 km (35.2 mi)[17] 337.7 (2018)[R 2]
2 Yerevan Armenia Yerevan Metro 1981[18] 1996[19] 10[18] 13.4 km (8.3 mi)[18] 18.7 (2018)[R 3]
3 Sydney Australia Sydney Metro 2019[20] – 13[20] 36 km (22 mi)[20][21] 14.2 (2019) [R 4][R Nb 1]
4 Vienna Austria Vienna U-Bahn 1976[22][Nb 2] 2017[23] 98[24] 83.3 km (51.8 mi)[22] 463.1 (2018)[R 6]
5 Baku Azerbaijan Baku Metro 1967[25] 2016[25] 25[25] 36.6 km (22.7 mi)[25] 231.0 (2018)[R 3]
We have now have the data in a DataFrame.
Step 2: Clean and convert the data
At first glance, we see that we do not need the rows City, Name, Yearopened, Year of last expansion. To make it easier to work with the data, let’s remove them and inspect the data again.
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_metro_systems'
tables = pd.read_html(url)
table = tables[0]
table = table.drop(['City', 'Name', 'Yearopened', 'Year of lastexpansion'], axis=1)
print(table)
Which result in the following output.
Country Stations System length Annual ridership(millions)
0 Algeria 19[14] 18.5 km (11.5 mi)[15] 45.3 (2019)[R 1]
1 Argentina 90[17] 56.7 km (35.2 mi)[17] 337.7 (2018)[R 2]
2 Armenia 10[18] 13.4 km (8.3 mi)[18] 18.7 (2018)[R 3]
3 Australia 13[20] 36 km (22 mi)[20][21] 14.2 (2019) [R 4][R Nb 1]
4 Austria 98[24] 83.3 km (51.8 mi)[22] 463.1 (2018)[R 6]
5 Azerbaijan 25[25] 36.6 km (22.7 mi)[25] 231.0 (2018)[R 3]
6 Belarus 29[27] 37.3 km (23.2 mi)[27] 283.4 (2018)[R 3]
7 Belgium 59[28][Nb 5] 39.9 km (24.8 mi)[29] 165.3 (2019)[R 7]
This makes it easier to see the next steps.
Let’s take them one by one. Stations need to remove the data after ‘[‘-symbol and convert the number to an integer. This can be done by using a lambda function to a row.
table['Stations'] = table.apply(lambda row: int(row['Stations'].split('[')[0]), axis=1)
If you are new to lambda functions we recommend you read this tutorial.
The next thing we need to do is to convert the System length to floats. The length will be in km (I live in Denmark, where we use km and not mi). This can also be done by using a lambda function
table['System length'] = table.apply(lambda row: float(row['System length'].split()[0]), axis=1)
Finally, and a bit more tricky, we need to convert the column of Annual ridership. The challenge is that lines have n/a which are converted to np.nan, but there are also some lines where the input is not easy to convert, as the images show.


These lines are can be dealt with by using a helper function.
def to_float(obj):
try:
return float(obj)
except:
return np.nan
index = 'Annual ridership(millions)'
table[index] = table.apply(lambda row: to_float(row[index].split()[0]) if row[index] is not np.nan else np.nan, axis=1)
Adding this all together we get the following code.
import pandas as pd
import numpy as np
def to_float(obj):
try:
return float(obj)
except:
return np.nan
url = 'https://en.wikipedia.org/wiki/List_of_metro_systems'
tables = pd.read_html(url)
table = tables[0]
table = table.drop(['City', 'Name', 'Yearopened', 'Year of lastexpansion'], axis=1)
table['Stations'] = table.apply(lambda row: int(row['Stations'].split('[')[0]), axis=1)
table['System length'] = table.apply(lambda row: float(row['System length'].split()[0]), axis=1)
index = 'Annual ridership(millions)'
table[index] = table.apply(lambda row: to_float(row[index].split()[0]) if row[index] is not np.nan else np.nan, axis=1)
print(table)
Which results in the following output (or the first few lines).
Country Stations System length Annual ridership(millions)
0 Algeria 19 18.50 45.30
1 Argentina 90 56.70 337.70
2 Armenia 10 13.40 18.70
3 Australia 13 36.00 14.20
4 Austria 98 83.30 463.10
5 Azerbaijan 25 36.60 231.00
6 Belarus 29 37.30 283.40
7 Belgium 59 39.90 165.30
8 Brazil 19 28.10 58.40
9 Brazil 25 42.40 42.80
10 Brazil 22 43.80 51.70
Step 3: Sum rows by country
Say, now we want to get the country with the most metro stations. This can be achieved by using the groupby and sum function from the pandas DataFrame data structure.
import pandas as pd
import numpy as np
def to_float(obj):
try:
return float(obj)
except:
return np.nan
url = 'https://en.wikipedia.org/wiki/List_of_metro_systems'
tables = pd.read_html(url)
table = tables[0]
table = table.drop(['City', 'Name', 'Yearopened', 'Year of lastexpansion'], axis=1)
table['Stations'] = table.apply(lambda row: int(row['Stations'].split('[')[0]), axis=1)
table['System length'] = table.apply(lambda row: float(row['System length'].split()[0]), axis=1)
index = 'Annual ridership(millions)'
table[index] = table.apply(lambda row: to_float(row[index].split()[0]) if row[index] is not np.nan else np.nan, axis=1)
# Sum up
table_sum = table.groupby(['Country']).sum()
print(table_sum.sort_values(['Stations'], ascending=False))
Where the result will be China.
Stations System length Annual ridership(millions)
Country
China 3738 6312.16 25519.23
United States 1005 1325.90 2771.50
South Korea 714 839.90 4054.90
Japan[Nb 34] 669 791.20 6489.60
India 499 675.97 1377.00
France 483 350.90 2113.50
Spain 438 474.40 1197.90
If we want to sort by km of System length, you will only need to change the last line to the following.
print(table_sum.sort_values(['System length'], ascending=False))
Resulting in the following.
Stations System length Annual ridership(millions)
Country
China 3738 6312.16 25519.23
United States 1005 1325.90 2771.50
South Korea 714 839.90 4054.90
Japan[Nb 34] 669 791.20 6489.60
India 499 675.97 1377.00
Russia 368 611.50 3507.60
United Kingdom 390 523.90 1555.30
Finally, if you want it by Annual ridership, you will need to change the last line to.
print(table_sum.sort_values([index], ascending=False))
Remember, we assigned that to index. You should get the following output.
Stations System length Annual ridership(millions)
Country
China 3738 6312.16 25519.23
Japan[Nb 34] 669 791.20 6489.60
South Korea 714 839.90 4054.90
Russia 368 611.50 3507.60
United States 1005 1325.90 2771.50
France 483 350.90 2113.50
Brazil 243 345.40 2106.20
Python Circle
Do you know what the 5 key success factors every programmer must have?
How is it possible that some people become programmer so fast?
While others struggle for years and still fail.
Not only do they learn python 10 times faster they solve complex problems with ease.
What separates them from the rest?
I identified these 5 success factors that every programmer must have to succeed:
- Collaboration: sharing your work with others and receiving help with any questions or challenges you may have.
- Networking: the ability to connect with the right people and leverage their knowledge, experience, and resources.
- Support: receive feedback on your work and ask questions without feeling intimidated or judged.
- Accountability: stay motivated and accountable to your learning goals by surrounding yourself with others who are also committed to learning Python.
- Feedback from the instructor: receiving feedback and support from an instructor with years of experience in the field.
I know how important these success factors are for growth and progress in mastering Python.
That is why I want to make them available to anyone struggling to learn or who just wants to improve faster.
With the Python Circle community, you can take advantage of 5 key success factors every programmer must have.

Be part of something bigger and join the Python Circle community.