What will we cover in this tutorial?
- The usual awesome stuff.
Step 1: Collect the data
The data we want to use is from wikipedia’s List of countries and dependencies by population.

When you work with data it is nice to use a library made for it. Here the Pandas library comes in handy, which is a powerful data analysis and manipulation tool.
Using the Pandas library, the data can be read into a DataFrame, which is the main data structure in the library. Using the read_html it returns a list of DataFrames, one per each table on the url in the argument. If you are new to read_html, we recommend you read this tutorial.
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
tables = pd.read_html(url)
table = tables[0]
print(table)
Which will print the table output here.
Rank Country (or dependent territory) Population % of worldpopulation Date Source
0 1 China[b] 1403554760 NaN 16 Jul 2020 National population clock[3]
1 2 India[c] 1364764203 NaN 16 Jul 2020 National population clock[4]
2 3 United States[d] 329963086 NaN 16 Jul 2020 National population clock[5]
3 4 Indonesia 269603400 NaN 1 Jul 2020 National annual projection[6]
4 5 Pakistan[e] 220892331 NaN 1 Jul 2020 UN Projection[2]
5 6 Brazil 211800078 NaN 16 Jul 2020 National population clock[7]
6 7 Nigeria 206139587 NaN 1 Jul 2020 UN Projection[2]
7 8 Bangladesh 168962650 NaN 16 Jul 2020 National population clock[8]
Step 2: Remove unnecessary columns from your data
A good second step is to remove columns you do not need. This can be done by a call to drop. As we only need the country names and populations, we can remove the rest of the columns.
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
tables = pd.read_html(url)
table = tables[0]
table = table.drop(['Rank', '% of worldpopulation', 'Date', 'Source'], axis=1)
print(table)
Which will result in the following output.
Country (or dependent territory) Population
0 China[b] 1403554760
1 India[c] 1364764203
2 United States[d] 329963086
3 Indonesia 269603400
4 Pakistan[e] 220892331
5 Brazil 211800078
6 Nigeria 206139587
7 Bangladesh 168962650
This makes it easier to understand the data.
Another thing you can do is to rename the column Country (or dependent territory) to Country. This makes your code easier to write when you need to access that column of data.
Let’s just do that.
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
tables = pd.read_html(url)
table = tables[0]
table = table.drop(['Rank', '% of worldpopulation', 'Date', 'Source'], axis=1)
table.columns = ['Country', 'Population']
print(table)
Resulting in the following output.
Country Population
0 China[b] 1403554760
1 India[c] 1364764203
2 United States[d] 329963086
3 Indonesia 269603400
4 Pakistan[e] 220892331
5 Brazil 211800078
6 Nigeria 206139587
7 Bangladesh 168962650
Step 3: Cleaning the data
We see that Country column can have two types of added information in the field. See examples here.
Country Population
0 China[b] 1403554760
195 Jersey (UK) 107800
Either it can have square brackets with a letter (example [b]) or a space and brackets and a country (example (UK)).
This can be cleaned by using a lambda function. If you are new to lambda functions we recommend you read this tutorial.
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
tables = pd.read_html(url)
table = tables[0]
table = table.drop(['Rank', '% of worldpopulation', 'Date', 'Source'], axis=1)
table.columns = ['Country', 'Population']
table['Country'] = table.apply(lambda row: row['Country'].split('[')[0], axis=1)
table['Country'] = table.apply(lambda row: row['Country'].split(' (')[0], axis=1)
print(table)
Which results in the following output.
Country Population
0 China 1403554760
1 India 1364764203
2 United States 329963086
3 Indonesia 269603400
4 Pakistan 220892331
5 Brazil 211800078
6 Nigeria 206139587
7 Bangladesh 168962650
Finally, if you investigate the last line of output.
241 World 7799525000
You see it is the sum of all the populations. This row is not part of the dataset and should be removed.
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
tables = pd.read_html(url)
table = tables[0]
table = table.drop(['Rank', '% of worldpopulation', 'Date', 'Source'], axis=1)
table.columns = ['Country', 'Population']
table['Country'] = table.apply(lambda row: row['Country'].split('[')[0], axis=1)
table['Country'] = table.apply(lambda row: row['Country'].split(' (')[0], axis=1)
table = table[table.Country != 'World']
print(table)
And it is gone. The line with table = table[table.Country != ‘World’] removes it.
Step 4: Adding latitude and longitudes to the data
This is where the GeoPy library comes in handy.
geopy makes it easy for Python developers to locate the coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders and other data sources.
web: https://geopy.readthedocs.io/en/stable/
It is easy to use, but… it is slow.
When you run the code several times and you debug, you want to avoid waiting for 200+ lookups. Hence, I have created a small persistence to reuse already lookup locations.
import numpy as np
from geopy.exc import GeocoderTimedOut
from geopy.geocoders import Nominatim
import pickle
import os
class Locator:
def __init__(self):
self.pickle_name = "location_store.pickle"
self.geo_locator = Nominatim(user_agent="LearnPython")
self.location_store = {}
if os.path.isfile(self.pickle_name):
f = open(self.pickle_name, "rb")
self.location_store = pickle.load(f)
f.close()
def get_location(self, location_name):
if location_name in self.location_store:
return self.location_store[location_name]
try:
location = self.geo_locator.geocode(location_name, language='en')
self.location_store[location_name] = location
f = open(self.pickle_name, 'wb')
pickle.dump(self.location_store, f)
f.close()
except GeocoderTimedOut:
location = None
return location
def get_latitude(self, location_name):
location = self.get_location(location_name)
if location:
return location.latitude
else:
return np.nan
def get_longitude(self, location_name):
location = self.get_location(location_name)
if location:
return location.longitude
else:
return np.nan
What we want to do with this class is to look up latitudes and longitudes and add them to our data source. As we run the code several times (or I did at least), I got tired of waiting for several long seconds (probably more than a minute) each time I ran the code. To save you and the planet for wasteful seconds, I share this code to you.
And now we can use it for adding data to our DataFrame.
import pandas as pd
import numpy as np
from geopy.exc import GeocoderTimedOut
from geopy.geocoders import Nominatim
import pickle
import os
class Locator:
def __init__(self):
self.pickle_name = "location_store.pickle"
self.geo_locator = Nominatim(user_agent="LearnPython")
self.location_store = {}
if os.path.isfile(self.pickle_name):
f = open(self.pickle_name, "rb")
self.location_store = pickle.load(f)
f.close()
def get_location(self, location_name):
if location_name in self.location_store:
return self.location_store[location_name]
try:
location = self.geo_locator.geocode(location_name, language='en')
self.location_store[location_name] = location
f = open(self.pickle_name, 'wb')
pickle.dump(self.location_store, f)
f.close()
except GeocoderTimedOut:
location = None
return location
def get_latitude(self, location_name):
location = self.get_location(location_name)
if location:
return location.latitude
else:
return np.nan
def get_longitude(self, location_name):
location = self.get_location(location_name)
if location:
return location.longitude
else:
return np.nan
url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
tables = pd.read_html(url)
table = tables[0]
table = table.drop(['Rank', '% of worldpopulation', 'Date', 'Source'], axis=1)
table.columns = ['Country', 'Population']
table['Country'] = table.apply(lambda row: row['Country'].split('[')[0], axis=1)
table['Country'] = table.apply(lambda row: row['Country'].split(' (')[0], axis=1)
table = table[table.Country != 'World']
locator = Locator()
table['Lat'] = table.apply(lambda row: locator.get_latitude(row['Country']), axis=1)
table['Lon'] = table.apply(lambda row: locator.get_longitude(row['Country']), axis=1)
print(table)
Which result in the following output.
Country Population Lat Lon
0 China 1403554760 35.000074 104.999927
1 India 1364764203 22.351115 78.667743
2 United States 329963086 39.783730 -100.445882
3 Indonesia 269603400 -2.483383 117.890285
4 Pakistan 220892331 30.330840 71.247499
5 Brazil 211800078 -10.333333 -53.200000
6 Nigeria 206139587 9.600036 7.999972
7 Bangladesh 168962650 24.476878 90.293243
There is actually one location which the GeoPy does not recognize.
231 Saint Helena, Ascensionand Tristan da Cunha 5633 NaN NaN
Instead of doing the right thing for these 5,633 people, which also count in the world population, I did the wrong thing.
table = table.dropna()
This call to dropna() does what you think. It removes the rows containing NaN, like the one above.
With my deepest respect to the people in Saint Helena, I apologize for my incorrect behavior.
Step 5: Create the plots
Now this is smart. To make histograms with Pandas it can do all the work for you. What you really want, is to do a accumulated histogram, which is called a weighted histogram.
Meaning, the histogram only counts occurrences. What we want is to add together values to see on which latitudes (and longitudes) people live.
We want it in percentage, to make life easier.
total_sum = table['Population'].sum()
table['Percentage'] = table.apply(lambda row: row['Population']/total_sum*100, axis=1)
Then we are ready for the final two plots in weighted histograms.
import pandas as pd
import numpy as np
from geopy.exc import GeocoderTimedOut
from geopy.geocoders import Nominatim
import pickle
import os
import matplotlib.pyplot as plt
class Locator:
def __init__(self):
self.pickle_name = "location_store.pickle"
self.geo_locator = Nominatim(user_agent="LearnPython")
self.location_store = {}
if os.path.isfile(self.pickle_name):
f = open(self.pickle_name, "rb")
self.location_store = pickle.load(f)
f.close()
def get_location(self, location_name):
if location_name in self.location_store:
return self.location_store[location_name]
try:
location = self.geo_locator.geocode(location_name, language='en')
self.location_store[location_name] = location
f = open(self.pickle_name, 'wb')
pickle.dump(self.location_store, f)
f.close()
except GeocoderTimedOut:
location = None
return location
def get_latitude(self, location_name):
location = self.get_location(location_name)
if location:
return location.latitude
else:
return np.nan
def get_longitude(self, location_name):
location = self.get_location(location_name)
if location:
return location.longitude
else:
return np.nan
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 1000)
url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
tables = pd.read_html(url)
table = tables[0]
table = table.drop(['Rank', '% of worldpopulation', 'Date', 'Source'], axis=1)
table.columns = ['Country', 'Population']
table['Country'] = table.apply(lambda row: row['Country'].split('[')[0], axis=1)
table['Country'] = table.apply(lambda row: row['Country'].split(' (')[0], axis=1)
table = table[table.Country != 'World']
locator = Locator()
table['Lat'] = table.apply(lambda row: locator.get_latitude(row['Country']), axis=1)
table['Lon'] = table.apply(lambda row: locator.get_longitude(row['Country']), axis=1)
# Sorry Saint Helena, please forgive me
table = table.dropna()
total_sum = table['Population'].sum()
table['Percentage'] = table.apply(lambda row: row['Population']/total_sum*100, axis=1)
table.hist(column='Lat', weights=table['Percentage'], orientation='horizontal', bins=[-50, -40, -30, -20, -10, 0, 10, 20, 30, 40, 50, 60, 70, 80])
plt.title('Percentage of World Population')
plt.xlabel('%')
plt.ylabel('Latitude')
plt.show()
bins = [i for i in range(-180, 181, 20)]
table.hist(column='Lon', weights=table['Percentage'], bins=bins)
plt.title('Percentage of World Population')
plt.xlabel('Longitude')
plt.ylabel('%')
plt.show()
Which should output the following two plots.


Python for Finance: Unlock Financial Freedom and Build Your Dream Life
Discover the key to financial freedom and secure your dream life with Python for Finance!
Say goodbye to financial anxiety and embrace a future filled with confidence and success. If you’re tired of struggling to pay bills and longing for a life of leisure, it’s time to take action.
Imagine breaking free from that dead-end job and opening doors to endless opportunities. With Python for Finance, you can acquire the invaluable skill of financial analysis that will revolutionize your life.
Make informed investment decisions, unlock the secrets of business financial performance, and maximize your money like never before. Gain the knowledge sought after by companies worldwide and become an indispensable asset in today’s competitive market.
Don’t let your dreams slip away. Master Python for Finance and pave your way to a profitable and fulfilling career. Start building the future you deserve today!
Python for Finance a 21 hours course that teaches investing with Python.
Learn pandas, NumPy, Matplotlib for Financial Analysis & learn how to Automate Value Investing.
“Excellent course for anyone trying to learn coding and investing.” – Lorenzo B.
