Pandas and GeoPy: Plot World Population by Latitude and Longitude using Weighted Histograms – 5 Step Tutorial

What will we cover in this tutorial?

  • The usual awesome stuff.

Step 1: Collect the data

The data we want to use is from wikipedia’s List of countries and dependencies by population.

From wikipedia.org

When you work with data it is nice to use a library made for it. Here the Pandas library comes in handy, which is a powerful data analysis and manipulation tool.

Using the Pandas library, the data can be read into a DataFrame, which is the main data structure in the library. Using the read_html it returns a list of DataFrames, one per each table on the url in the argument. If you are new to read_html, we recommend you read this tutorial.

import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
tables = pd.read_html(url)

table = tables[0]

print(table)

Which will print the table output here.

    Rank                  Country (or dependent territory)  Population % of worldpopulation             Date                                             Source
0      1                                          China[b]  1403554760                  NaN      16 Jul 2020                       National population clock[3]
1      2                                          India  1364764203                  NaN      16 Jul 2020                       National population clock[4]
2      3                                  United States[d]   329963086                  NaN      16 Jul 2020                       National population clock[5]
3      4                                         Indonesia   269603400                  NaN       1 Jul 2020                      National annual projection[6]
4      5                                       Pakistan[e]   220892331                  NaN       1 Jul 2020                                   UN Projection[2]
5      6                                            Brazil   211800078                  NaN      16 Jul 2020                       National population clock[7]
6      7                                           Nigeria   206139587                  NaN       1 Jul 2020                                   UN Projection[2]
7      8                                        Bangladesh   168962650                  NaN      16 Jul 2020                       National population clock[8]

Step 2: Remove unnecessary columns from your data

A good second step is to remove columns you do not need. This can be done by a call to drop. As we only need the country names and populations, we can remove the rest of the columns.

import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
tables = pd.read_html(url)

table = tables[0]

table = table.drop(['Rank', '% of worldpopulation', 'Date', 'Source'], axis=1)

print(table)

Which will result in the following output.

                     Country (or dependent territory)  Population
0                                            China[b]  1403554760
1                                            India  1364764203
2                                    United States[d]   329963086
3                                           Indonesia   269603400
4                                         Pakistan[e]   220892331
5                                              Brazil   211800078
6                                             Nigeria   206139587
7                                          Bangladesh   168962650

This makes it easier to understand the data.

Another thing you can do is to rename the column Country (or dependent territory) to Country. This makes your code easier to write when you need to access that column of data.

Let’s just do that.

import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
tables = pd.read_html(url)

table = tables[0]

table = table.drop(['Rank', '% of worldpopulation', 'Date', 'Source'], axis=1)
table.columns = ['Country', 'Population']

print(table)

Resulting in the following output.

                                              Country  Population
0                                            China[b]  1403554760
1                                            India  1364764203
2                                    United States[d]   329963086
3                                           Indonesia   269603400
4                                         Pakistan[e]   220892331
5                                              Brazil   211800078
6                                             Nigeria   206139587
7                                          Bangladesh   168962650

Step 3: Cleaning the data

We see that Country column can have two types of added information in the field. See examples here.

                                              Country  Population
0                                            China[b]  1403554760
195                                       Jersey (UK)      107800

Either it can have square brackets with a letter (example [b]) or a space and brackets and a country (example (UK)).

This can be cleaned by using a lambda function. If you are new to lambda functions we recommend you read this tutorial.

import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
tables = pd.read_html(url)

table = tables[0]

table = table.drop(['Rank', '% of worldpopulation', 'Date', 'Source'], axis=1)
table.columns = ['Country', 'Population']

table['Country'] = table.apply(lambda row: row['Country'].split('[')[0], axis=1)
table['Country'] = table.apply(lambda row: row['Country'].split(' (')[0], axis=1)
print(table)

Which results in the following output.

                                         Country  Population
0                                          China  1403554760
1                                          India  1364764203
2                                  United States   329963086
3                                      Indonesia   269603400
4                                       Pakistan   220892331
5                                         Brazil   211800078
6                                        Nigeria   206139587
7                                     Bangladesh   168962650

Finally, if you investigate the last line of output.

241                                        World  7799525000

You see it is the sum of all the populations. This row is not part of the dataset and should be removed.

import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
tables = pd.read_html(url)

table = tables[0]

table = table.drop(['Rank', '% of worldpopulation', 'Date', 'Source'], axis=1)
table.columns = ['Country', 'Population']

table['Country'] = table.apply(lambda row: row['Country'].split('[')[0], axis=1)
table['Country'] = table.apply(lambda row: row['Country'].split(' (')[0], axis=1)
table = table[table.Country != 'World']
print(table)

And it is gone. The line with table = table[table.Country != ‘World’] removes it.

Step 4: Adding latitude and longitudes to the data

This is where the GeoPy library comes in handy.

geopy makes it easy for Python developers to locate the coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders and other data sources.

web: https://geopy.readthedocs.io/en/stable/

It is easy to use, but… it is slow.

When you run the code several times and you debug, you want to avoid waiting for 200+ lookups. Hence, I have created a small persistence to reuse already lookup locations.

import numpy as np
from geopy.exc import GeocoderTimedOut
from geopy.geocoders import Nominatim
import pickle
import os


class Locator:
    def __init__(self):
        self.pickle_name = "location_store.pickle"
        self.geo_locator = Nominatim(user_agent="LearnPython")
        self.location_store = {}
        if os.path.isfile(self.pickle_name):
            f = open(self.pickle_name, "rb")
            self.location_store = pickle.load(f)
            f.close()

    def get_location(self, location_name):
        if location_name in self.location_store:
            return self.location_store[location_name]
        try:
            location = self.geo_locator.geocode(location_name, language='en')
            self.location_store[location_name] = location
            f = open(self.pickle_name, 'wb')
            pickle.dump(self.location_store, f)
            f.close()
        except GeocoderTimedOut:
            location = None
        return location

    def get_latitude(self, location_name):
        location = self.get_location(location_name)
        if location:
            return location.latitude
        else:
            return np.nan

    def get_longitude(self, location_name):
        location = self.get_location(location_name)
        if location:
            return location.longitude
        else:
            return np.nan

What we want to do with this class is to look up latitudes and longitudes and add them to our data source. As we run the code several times (or I did at least), I got tired of waiting for several long seconds (probably more than a minute) each time I ran the code. To save you and the planet for wasteful seconds, I share this code to you.

And now we can use it for adding data to our DataFrame.

import pandas as pd
import numpy as np
from geopy.exc import GeocoderTimedOut
from geopy.geocoders import Nominatim
import pickle
import os

class Locator:
    def __init__(self):
        self.pickle_name = "location_store.pickle"
        self.geo_locator = Nominatim(user_agent="LearnPython")
        self.location_store = {}
        if os.path.isfile(self.pickle_name):
            f = open(self.pickle_name, "rb")
            self.location_store = pickle.load(f)
            f.close()

    def get_location(self, location_name):
        if location_name in self.location_store:
            return self.location_store[location_name]
        try:
            location = self.geo_locator.geocode(location_name, language='en')
            self.location_store[location_name] = location
            f = open(self.pickle_name, 'wb')
            pickle.dump(self.location_store, f)
            f.close()
        except GeocoderTimedOut:
            location = None
        return location

    def get_latitude(self, location_name):
        location = self.get_location(location_name)
        if location:
            return location.latitude
        else:
            return np.nan

    def get_longitude(self, location_name):
        location = self.get_location(location_name)
        if location:
            return location.longitude
        else:
            return np.nan

url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
tables = pd.read_html(url)

table = tables[0]

table = table.drop(['Rank', '% of worldpopulation', 'Date', 'Source'], axis=1)
table.columns = ['Country', 'Population']

table['Country'] = table.apply(lambda row: row['Country'].split('[')[0], axis=1)
table['Country'] = table.apply(lambda row: row['Country'].split(' (')[0], axis=1)
table = table[table.Country != 'World']

locator = Locator()

table['Lat'] = table.apply(lambda row: locator.get_latitude(row['Country']), axis=1)
table['Lon'] = table.apply(lambda row: locator.get_longitude(row['Country']), axis=1)
print(table)

Which result in the following output.

                                         Country  Population        Lat         Lon
0                                          China  1403554760  35.000074  104.999927
1                                          India  1364764203  22.351115   78.667743
2                                  United States   329963086  39.783730 -100.445882
3                                      Indonesia   269603400  -2.483383  117.890285
4                                       Pakistan   220892331  30.330840   71.247499
5                                         Brazil   211800078 -10.333333  -53.200000
6                                        Nigeria   206139587   9.600036    7.999972
7                                     Bangladesh   168962650  24.476878   90.293243

There is actually one location which the GeoPy does not recognize.

231  Saint Helena, Ascensionand Tristan da Cunha        5633        NaN         NaN

Instead of doing the right thing for these 5,633 people, which also count in the world population, I did the wrong thing.

table = table.dropna()

This call to dropna() does what you think. It removes the rows containing NaN, like the one above.

With my deepest respect to the people in Saint Helena, I apologize for my incorrect behavior.

Step 5: Create the plots

Now this is smart. To make histograms with Pandas it can do all the work for you. What you really want, is to do a accumulated histogram, which is called a weighted histogram.

Meaning, the histogram only counts occurrences. What we want is to add together values to see on which latitudes (and longitudes) people live.

We want it in percentage, to make life easier.

total_sum = table['Population'].sum()
table['Percentage'] = table.apply(lambda row: row['Population']/total_sum*100, axis=1)

Then we are ready for the final two plots in weighted histograms.

import pandas as pd
import numpy as np
from geopy.exc import GeocoderTimedOut
from geopy.geocoders import Nominatim
import pickle
import os
import matplotlib.pyplot as plt


class Locator:
    def __init__(self):
        self.pickle_name = "location_store.pickle"
        self.geo_locator = Nominatim(user_agent="LearnPython")
        self.location_store = {}
        if os.path.isfile(self.pickle_name):
            f = open(self.pickle_name, "rb")
            self.location_store = pickle.load(f)
            f.close()

    def get_location(self, location_name):
        if location_name in self.location_store:
            return self.location_store[location_name]
        try:
            location = self.geo_locator.geocode(location_name, language='en')
            self.location_store[location_name] = location
            f = open(self.pickle_name, 'wb')
            pickle.dump(self.location_store, f)
            f.close()
        except GeocoderTimedOut:
            location = None
        return location

    def get_latitude(self, location_name):
        location = self.get_location(location_name)
        if location:
            return location.latitude
        else:
            return np.nan

    def get_longitude(self, location_name):
        location = self.get_location(location_name)
        if location:
            return location.longitude
        else:
            return np.nan


pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 1000)

url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
tables = pd.read_html(url)

table = tables[0]

table = table.drop(['Rank', '% of worldpopulation', 'Date', 'Source'], axis=1)
table.columns = ['Country', 'Population']

table['Country'] = table.apply(lambda row: row['Country'].split('[')[0], axis=1)
table['Country'] = table.apply(lambda row: row['Country'].split(' (')[0], axis=1)
table = table[table.Country != 'World']

locator = Locator()

table['Lat'] = table.apply(lambda row: locator.get_latitude(row['Country']), axis=1)
table['Lon'] = table.apply(lambda row: locator.get_longitude(row['Country']), axis=1)
# Sorry Saint Helena, please forgive me
table = table.dropna()

total_sum = table['Population'].sum()
table['Percentage'] = table.apply(lambda row: row['Population']/total_sum*100, axis=1)

table.hist(column='Lat', weights=table['Percentage'], orientation='horizontal', bins=[-50, -40, -30, -20, -10, 0, 10, 20, 30, 40, 50, 60, 70, 80])
plt.title('Percentage of World Population')
plt.xlabel('%')
plt.ylabel('Latitude')
plt.show()

bins = [i for i in range(-180, 181, 20)]

table.hist(column='Lon', weights=table['Percentage'], bins=bins)
plt.title('Percentage of World Population')
plt.xlabel('Longitude')
plt.ylabel('%')
plt.show()

Which should output the following two plots.

First result
Second result

Leave a Reply