Learn how you can become a Python programmer in just 12 weeks.

    We respect your privacy. Unsubscribe at anytime.

    Pandas and GeoPy: Plot World Population by Latitude and Longitude using Weighted Histograms – 5 Step Tutorial

    What will we cover in this tutorial?

    • The usual awesome stuff.

    Step 1: Collect the data

    The data we want to use is from wikipedia’s List of countries and dependencies by population.

    From wikipedia.org

    When you work with data it is nice to use a library made for it. Here the Pandas library comes in handy, which is a powerful data analysis and manipulation tool.

    Using the Pandas library, the data can be read into a DataFrame, which is the main data structure in the library. Using the read_html it returns a list of DataFrames, one per each table on the url in the argument. If you are new to read_html, we recommend you read this tutorial.

    import pandas as pd
    url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
    tables = pd.read_html(url)
    table = tables[0]
    print(table)
    

    Which will print the table output here.

        Rank                  Country (or dependent territory)  Population % of worldpopulation             Date                                             Source
    0      1                                          China[b]  1403554760                  NaN      16 Jul 2020                       National population clock[3]
    1      2                                          India[c]  1364764203                  NaN      16 Jul 2020                       National population clock[4]
    2      3                                  United States[d]   329963086                  NaN      16 Jul 2020                       National population clock[5]
    3      4                                         Indonesia   269603400                  NaN       1 Jul 2020                      National annual projection[6]
    4      5                                       Pakistan[e]   220892331                  NaN       1 Jul 2020                                   UN Projection[2]
    5      6                                            Brazil   211800078                  NaN      16 Jul 2020                       National population clock[7]
    6      7                                           Nigeria   206139587                  NaN       1 Jul 2020                                   UN Projection[2]
    7      8                                        Bangladesh   168962650                  NaN      16 Jul 2020                       National population clock[8]
    

    Step 2: Remove unnecessary columns from your data

    A good second step is to remove columns you do not need. This can be done by a call to drop. As we only need the country names and populations, we can remove the rest of the columns.

    import pandas as pd
    url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
    tables = pd.read_html(url)
    table = tables[0]
    table = table.drop(['Rank', '% of worldpopulation', 'Date', 'Source'], axis=1)
    print(table)
    

    Which will result in the following output.

                         Country (or dependent territory)  Population
    0                                            China[b]  1403554760
    1                                            India[c]  1364764203
    2                                    United States[d]   329963086
    3                                           Indonesia   269603400
    4                                         Pakistan[e]   220892331
    5                                              Brazil   211800078
    6                                             Nigeria   206139587
    7                                          Bangladesh   168962650
    

    This makes it easier to understand the data.

    Another thing you can do is to rename the column Country (or dependent territory) to Country. This makes your code easier to write when you need to access that column of data.

    Let’s just do that.

    import pandas as pd
    url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
    tables = pd.read_html(url)
    table = tables[0]
    table = table.drop(['Rank', '% of worldpopulation', 'Date', 'Source'], axis=1)
    table.columns = ['Country', 'Population']
    print(table)
    

    Resulting in the following output.

                                                  Country  Population
    0                                            China[b]  1403554760
    1                                            India[c]  1364764203
    2                                    United States[d]   329963086
    3                                           Indonesia   269603400
    4                                         Pakistan[e]   220892331
    5                                              Brazil   211800078
    6                                             Nigeria   206139587
    7                                          Bangladesh   168962650
    

    Step 3: Cleaning the data

    We see that Country column can have two types of added information in the field. See examples here.

                                                  Country  Population
    0                                            China[b]  1403554760
    195                                       Jersey (UK)      107800
    

    Either it can have square brackets with a letter (example [b]) or a space and brackets and a country (example (UK)).

    This can be cleaned by using a lambda function. If you are new to lambda functions we recommend you read this tutorial.

    import pandas as pd
    url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
    tables = pd.read_html(url)
    table = tables[0]
    table = table.drop(['Rank', '% of worldpopulation', 'Date', 'Source'], axis=1)
    table.columns = ['Country', 'Population']
    table['Country'] = table.apply(lambda row: row['Country'].split('[')[0], axis=1)
    table['Country'] = table.apply(lambda row: row['Country'].split(' (')[0], axis=1)
    print(table)
    

    Which results in the following output.

                                             Country  Population
    0                                          China  1403554760
    1                                          India  1364764203
    2                                  United States   329963086
    3                                      Indonesia   269603400
    4                                       Pakistan   220892331
    5                                         Brazil   211800078
    6                                        Nigeria   206139587
    7                                     Bangladesh   168962650
    

    Finally, if you investigate the last line of output.

    241                                        World  7799525000
    

    You see it is the sum of all the populations. This row is not part of the dataset and should be removed.

    import pandas as pd
    url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
    tables = pd.read_html(url)
    table = tables[0]
    table = table.drop(['Rank', '% of worldpopulation', 'Date', 'Source'], axis=1)
    table.columns = ['Country', 'Population']
    table['Country'] = table.apply(lambda row: row['Country'].split('[')[0], axis=1)
    table['Country'] = table.apply(lambda row: row['Country'].split(' (')[0], axis=1)
    table = table[table.Country != 'World']
    print(table)
    

    And it is gone. The line with table = table[table.Country != ‘World’] removes it.

    Step 4: Adding latitude and longitudes to the data

    This is where the GeoPy library comes in handy.

    geopy makes it easy for Python developers to locate the coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders and other data sources.

    web: https://geopy.readthedocs.io/en/stable/

    It is easy to use, but… it is slow.

    When you run the code several times and you debug, you want to avoid waiting for 200+ lookups. Hence, I have created a small persistence to reuse already lookup locations.

    import numpy as np
    from geopy.exc import GeocoderTimedOut
    from geopy.geocoders import Nominatim
    import pickle
    import os
    
    class Locator:
        def __init__(self):
            self.pickle_name = "location_store.pickle"
            self.geo_locator = Nominatim(user_agent="LearnPython")
            self.location_store = {}
            if os.path.isfile(self.pickle_name):
                f = open(self.pickle_name, "rb")
                self.location_store = pickle.load(f)
                f.close()
        def get_location(self, location_name):
            if location_name in self.location_store:
                return self.location_store[location_name]
            try:
                location = self.geo_locator.geocode(location_name, language='en')
                self.location_store[location_name] = location
                f = open(self.pickle_name, 'wb')
                pickle.dump(self.location_store, f)
                f.close()
            except GeocoderTimedOut:
                location = None
            return location
        def get_latitude(self, location_name):
            location = self.get_location(location_name)
            if location:
                return location.latitude
            else:
                return np.nan
        def get_longitude(self, location_name):
            location = self.get_location(location_name)
            if location:
                return location.longitude
            else:
                return np.nan
    

    What we want to do with this class is to look up latitudes and longitudes and add them to our data source. As we run the code several times (or I did at least), I got tired of waiting for several long seconds (probably more than a minute) each time I ran the code. To save you and the planet for wasteful seconds, I share this code to you.

    And now we can use it for adding data to our DataFrame.

    import pandas as pd
    import numpy as np
    from geopy.exc import GeocoderTimedOut
    from geopy.geocoders import Nominatim
    import pickle
    import os
    class Locator:
        def __init__(self):
            self.pickle_name = "location_store.pickle"
            self.geo_locator = Nominatim(user_agent="LearnPython")
            self.location_store = {}
            if os.path.isfile(self.pickle_name):
                f = open(self.pickle_name, "rb")
                self.location_store = pickle.load(f)
                f.close()
        def get_location(self, location_name):
            if location_name in self.location_store:
                return self.location_store[location_name]
            try:
                location = self.geo_locator.geocode(location_name, language='en')
                self.location_store[location_name] = location
                f = open(self.pickle_name, 'wb')
                pickle.dump(self.location_store, f)
                f.close()
            except GeocoderTimedOut:
                location = None
            return location
        def get_latitude(self, location_name):
            location = self.get_location(location_name)
            if location:
                return location.latitude
            else:
                return np.nan
        def get_longitude(self, location_name):
            location = self.get_location(location_name)
            if location:
                return location.longitude
            else:
                return np.nan
    url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
    tables = pd.read_html(url)
    table = tables[0]
    table = table.drop(['Rank', '% of worldpopulation', 'Date', 'Source'], axis=1)
    table.columns = ['Country', 'Population']
    table['Country'] = table.apply(lambda row: row['Country'].split('[')[0], axis=1)
    table['Country'] = table.apply(lambda row: row['Country'].split(' (')[0], axis=1)
    table = table[table.Country != 'World']
    locator = Locator()
    table['Lat'] = table.apply(lambda row: locator.get_latitude(row['Country']), axis=1)
    table['Lon'] = table.apply(lambda row: locator.get_longitude(row['Country']), axis=1)
    print(table)
    

    Which result in the following output.

                                             Country  Population        Lat         Lon
    0                                          China  1403554760  35.000074  104.999927
    1                                          India  1364764203  22.351115   78.667743
    2                                  United States   329963086  39.783730 -100.445882
    3                                      Indonesia   269603400  -2.483383  117.890285
    4                                       Pakistan   220892331  30.330840   71.247499
    5                                         Brazil   211800078 -10.333333  -53.200000
    6                                        Nigeria   206139587   9.600036    7.999972
    7                                     Bangladesh   168962650  24.476878   90.293243
    

    There is actually one location which the GeoPy does not recognize.

    231  Saint Helena, Ascensionand Tristan da Cunha        5633        NaN         NaN
    

    Instead of doing the right thing for these 5,633 people, which also count in the world population, I did the wrong thing.

    table = table.dropna()
    

    This call to dropna() does what you think. It removes the rows containing NaN, like the one above.

    With my deepest respect to the people in Saint Helena, I apologize for my incorrect behavior.

    Step 5: Create the plots

    Now this is smart. To make histograms with Pandas it can do all the work for you. What you really want, is to do a accumulated histogram, which is called a weighted histogram.

    Meaning, the histogram only counts occurrences. What we want is to add together values to see on which latitudes (and longitudes) people live.

    We want it in percentage, to make life easier.

    total_sum = table['Population'].sum()
    table['Percentage'] = table.apply(lambda row: row['Population']/total_sum*100, axis=1)
    

    Then we are ready for the final two plots in weighted histograms.

    import pandas as pd
    import numpy as np
    from geopy.exc import GeocoderTimedOut
    from geopy.geocoders import Nominatim
    import pickle
    import os
    import matplotlib.pyplot as plt
    
    class Locator:
        def __init__(self):
            self.pickle_name = "location_store.pickle"
            self.geo_locator = Nominatim(user_agent="LearnPython")
            self.location_store = {}
            if os.path.isfile(self.pickle_name):
                f = open(self.pickle_name, "rb")
                self.location_store = pickle.load(f)
                f.close()
        def get_location(self, location_name):
            if location_name in self.location_store:
                return self.location_store[location_name]
            try:
                location = self.geo_locator.geocode(location_name, language='en')
                self.location_store[location_name] = location
                f = open(self.pickle_name, 'wb')
                pickle.dump(self.location_store, f)
                f.close()
            except GeocoderTimedOut:
                location = None
            return location
        def get_latitude(self, location_name):
            location = self.get_location(location_name)
            if location:
                return location.latitude
            else:
                return np.nan
        def get_longitude(self, location_name):
            location = self.get_location(location_name)
            if location:
                return location.longitude
            else:
                return np.nan
    
    pd.set_option('display.max_rows', 300)
    pd.set_option('display.max_columns', 10)
    pd.set_option('display.width', 1000)
    url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
    tables = pd.read_html(url)
    table = tables[0]
    table = table.drop(['Rank', '% of worldpopulation', 'Date', 'Source'], axis=1)
    table.columns = ['Country', 'Population']
    table['Country'] = table.apply(lambda row: row['Country'].split('[')[0], axis=1)
    table['Country'] = table.apply(lambda row: row['Country'].split(' (')[0], axis=1)
    table = table[table.Country != 'World']
    locator = Locator()
    table['Lat'] = table.apply(lambda row: locator.get_latitude(row['Country']), axis=1)
    table['Lon'] = table.apply(lambda row: locator.get_longitude(row['Country']), axis=1)
    # Sorry Saint Helena, please forgive me
    table = table.dropna()
    total_sum = table['Population'].sum()
    table['Percentage'] = table.apply(lambda row: row['Population']/total_sum*100, axis=1)
    table.hist(column='Lat', weights=table['Percentage'], orientation='horizontal', bins=[-50, -40, -30, -20, -10, 0, 10, 20, 30, 40, 50, 60, 70, 80])
    plt.title('Percentage of World Population')
    plt.xlabel('%')
    plt.ylabel('Latitude')
    plt.show()
    bins = [i for i in range(-180, 181, 20)]
    table.hist(column='Lon', weights=table['Percentage'], bins=bins)
    plt.title('Percentage of World Population')
    plt.xlabel('Longitude')
    plt.ylabel('%')
    plt.show()
    

    Which should output the following two plots.

    First result
    Second result

    Python for Finance: Unlock Financial Freedom and Build Your Dream Life

    Discover the key to financial freedom and secure your dream life with Python for Finance!

    Say goodbye to financial anxiety and embrace a future filled with confidence and success. If you’re tired of struggling to pay bills and longing for a life of leisure, it’s time to take action.

    Imagine breaking free from that dead-end job and opening doors to endless opportunities. With Python for Finance, you can acquire the invaluable skill of financial analysis that will revolutionize your life.

    Make informed investment decisions, unlock the secrets of business financial performance, and maximize your money like never before. Gain the knowledge sought after by companies worldwide and become an indispensable asset in today’s competitive market.

    Don’t let your dreams slip away. Master Python for Finance and pave your way to a profitable and fulfilling career. Start building the future you deserve today!

    Python for Finance a 21 hours course that teaches investing with Python.

    Learn pandas, NumPy, Matplotlib for Financial Analysis & learn how to Automate Value Investing.

    “Excellent course for anyone trying to learn coding and investing.” – Lorenzo B.

    Leave a Comment