Learn how you can become a Python programmer in just 12 weeks.

    We respect your privacy. Unsubscribe at anytime.

    Pandas and Folium: Categorize GDP Growth by Country and Visualize on Map in 3 Easy Steps

    What will we cover in this tutorial?

    • We will gather data from wikipedia.org List of countries by past and projected GDP using pandas.
    • First step will be get the data and merge the correct tables together.
    • Next step is using Machine Learning with Linear regression model to estimate the growth of each country GDP.
    • Final step is to visualize the growth rates on a leaflet map using folium.

    Step 1: Get the data and merge it

    The data is available on wikipedia on List of countries by past and projected GDP. We will focus on data from 1990 to 2019.

    At first glance on the page you notice that the date is not gathered in one table.

    From wikipedia.org

    The first task will be to merge the three tables with the data from 1990-1999, 2000-2009, and 2010-2019.

    The data can be collected by pandas read_html function. If you are new to this you can read this tutorial.

    import pandas as pd
    # The URL we will read our data from
    url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
    # read_html returns a list of tables from the URL
    tables = pd.read_html(url)
    # Merge the tables into one table
    merge_index = 'Country (or dependent territory)'
    table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
    table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])
    print(table)
    

    The call to read_html will return all the tables in a list. By inspecting the results you will notice that we are interested in table 9, 12 and 15 and merge them. The output of the above will be.

         Country (or dependent territory)       1990       1991       1992       1993       1994       1995       1996       1997       1998       1999        2000        2001        2002        2003        2004        2005        2006        2007        2008        2009        2010        2011        2012        2013        2014        2015        2016        2017        2018        2019
    0                         Afghanistan        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN         NaN         NaN      4367.0      4514.0      5146.0      6167.0      6925.0      8556.0     10297.0     12066.0     15325.0     17890.0     20296.0     20170.0     20352.0     19687.0     19454.0     20235.0     19585.0     19990.0
    1                             Albania     2221.0     1333.0      843.0     1461.0     2361.0     2882.0     3200.0     2259.0     2560.0     3209.0      3483.0      3928.0      4348.0      5611.0      7185.0      8052.0      8905.0     10675.0     12901.0     12093.0     11938.0     12896.0     12323.0     12784.0     13238.0     11393.0     11865.0     13055.0     15202.0     15960.0
    2                             Algeria    61892.0    46670.0    49217.0    50963.0    42426.0    42066.0    46941.0    48178.0    48188.0    48845.0     54749.0     54745.0     56761.0     67864.0     85327.0    103198.0    117027.0    134977.0    171001.0    137054.0    161207.0    199394.0    209005.0    209703.0    213518.0    164779.0    159049.0    167555.0    180441.0    183687.0
    3                              Angola    11236.0    10891.0     8398.0     6095.0     4438.0     5539.0     6535.0     7675.0     6506.0     6153.0      9130.0      8936.0     12497.0     14189.0     19641.0     28234.0     41789.0     60449.0     84178.0     75492.0     82471.0    104116.0    115342.0    124912.0    126777.0    102962.0     95337.0    122124.0    107316.0     92191.0
    4                 Antigua and Barbuda      459.0      482.0      499.0      535.0      589.0      577.0      634.0      681.0      728.0      766.0       825.0       796.0       810.0       850.0       912.0      1013.0      1147.0      1299.0      1358.0      1216.0      1146.0      1140.0      1214.0      1194.0      1273.0      1353.0      1460.0      1516.0      1626.0      1717.0
    5                           Argentina   153205.0   205515.0   247987.0   256365.0   279150.0   280080.0   295120.0   317549.0   324242.0   307673.0    308491.0    291738.0    108731.0    138151.0    164922.0    199273.0    232892.0    287920.0    363545.0    334633.0    424728.0    527644.0    579666.0    611471.0    563614.0    631621.0    554107.0    642928.0    518092.0    477743.0
    6                             Armenia        NaN        NaN      108.0      835.0      648.0     1287.0     1597.0     1639.0     1892.0     1845.0      1912.0      2118.0      2376.0      2807.0      3577.0      4900.0      6384.0      9206.0     11662.0      8648.0      9260.0     10142.0     10619.0     11121.0     11610.0     10529.0     10572.0     11537.0     12411.0     13105.0
    

    Step 2: Use linear regression to estimate the growth over the last 30 years

    In this section we will use Linear regression from the scikit-learn library, which is a simple prediction tool.

    If you are new to Machine Learning we recommend you read this tutorial on Linear regression.

    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn.linear_model import LinearRegression
    import numpy as np
    # The URL we will read our data from
    url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
    # read_html returns a list of tables from the URL
    tables = pd.read_html(url)
    # Merge the tables into one table
    merge_index = 'Country (or dependent territory)'
    table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
    table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])
    row = table.iloc[1]
    X = table.columns[1:].to_numpy().reshape(-1, 1)
    X = X.astype(int)
    Y = 1 + row.iloc[1:].pct_change()
    Y = Y.cumprod().fillna(1.0).to_numpy()
    Y = Y.reshape(-1, 1)
    regr = LinearRegression()
    regr.fit(X, Y)
    Y_pred = regr.predict(X)
    plt.scatter(X, Y)
    plt.plot(X, Y_pred, color='red')
    plt.show()
    

    Which will result in the following plot.

    Linear regression model applied on data from wikipedia.org

    Which shows that the model approximates a line through the 30 years of data to estimate the growth of the country’s GDP.

    Notice that we use the product (cumprod) of pct_change to be able to compare the data. If we used the data directly, we would not be possible to compare it.

    We will do that for all countries to get a view of the growth. We are using the coefficient of the line, which indicates the growth rate.

    import pandas as pd
    from sklearn.linear_model import LinearRegression
    import numpy as np
    # The URL we will read our data from
    url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
    # read_html returns a list of tables from the URL
    tables = pd.read_html(url)
    # Merge the tables into one table
    merge_index = 'Country (or dependent territory)'
    table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
    table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])
    coef = []
    countries = []
    for index, row in table.iterrows():
        #print(row)
        X = table.columns[1:].to_numpy().reshape(-1, 1)
        X = X.astype(int)
        Y = 1 + row.iloc[1:].pct_change()
        Y = Y.cumprod().fillna(1.0).to_numpy()
        Y = Y.reshape(-1, 1)
        regr = LinearRegression()
        regr.fit(X, Y)
        coef.append(regr.coef_[0][0])
        countries.append(row[merge_index])
    data = pd.DataFrame(list(zip(countries, coef)), columns=['Country', 'Coef'])
    print(data)
    

    Which results in the following output (or the first few lines).

                                  Country      Coef
    0                         Afghanistan  0.161847
    1                             Albania  0.243493
    2                             Algeria  0.103907
    3                              Angola  0.423919
    4                 Antigua and Barbuda  0.087863
    5                           Argentina  0.090837
    6                             Armenia  4.699598
    

    Step 3: Merge the data to a leaflet map using folium

    The last step is to merge the data together with the leaflet map using the folium library. If you are new to folium we recommend you read this tutorial.

    import pandas as pd
    import folium
    import geopandas
    from sklearn.linear_model import LinearRegression
    import numpy as np
    # The URL we will read our data from
    url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
    # read_html returns a list of tables from the URL
    tables = pd.read_html(url)
    # Merge the tables into one table
    merge_index = 'Country (or dependent territory)'
    table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
    table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])
    coef = []
    countries = []
    for index, row in table.iterrows():
        X = table.columns[1:].to_numpy().reshape(-1, 1)
        X = X.astype(int)
        Y = 1 + row.iloc[1:].pct_change()
        Y = Y.cumprod().fillna(1.0).to_numpy()
        Y = Y.reshape(-1, 1)
        regr = LinearRegression()
        regr.fit(X, Y)
        coef.append(regr.coef_[0][0])
        countries.append(row[merge_index])
    data = pd.DataFrame(list(zip(countries, coef)), columns=['Country', 'Coef'])
    # Read the geopandas dataset
    world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
    # Replace United States of America to United States to fit the naming in the table
    world = world.replace('United States of America', 'United States')
    # Merge the two DataFrames together
    table = world.merge(data, how="left", left_on=['name'], right_on=['Country'])
    
    # Clean data: remove rows with no data
    table = table.dropna(subset=['Coef'])
    # We have 10 colors available resulting into 9 cuts.
    table['Cat'] = pd.qcut(table['Coef'], 9, labels=[0, 1, 2, 3, 4, 5, 6, 7, 8])
    print(table)
    # Create a map
    my_map = folium.Map()
    # Add the data
    folium.Choropleth(
        geo_data=table,
        name='choropleth',
        data=table,
        columns=['Country', 'Cat'],
        key_on='feature.properties.name',
        fill_color='YlGn',
        fill_opacity=0.7,
        line_opacity=0.2,
        legend_name='Growth of GDP since 1990',
        threshold_scale=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    ).add_to(my_map)
    my_map.save('gdp_growth.html')
    

    There is a twist in the way it is done. Instead of using a linear model to represent the growth rate on the map, we chose to add them in categories. The reason is that otherwise most countries group in small segment.

    Here we have used the qcut to add them in each equal sized group.

    This should result in an interactive html page looking something like this.

    End result.

    Python Circle

    Do you know what the 5 key success factors every programmer must have?

    How is it possible that some people become programmer so fast?

    While others struggle for years and still fail.

    Not only do they learn python 10 times faster they solve complex problems with ease.

    What separates them from the rest?

    I identified these 5 success factors that every programmer must have to succeed:

    1. Collaboration: sharing your work with others and receiving help with any questions or challenges you may have.
    2. Networking: the ability to connect with the right people and leverage their knowledge, experience, and resources.
    3. Support: receive feedback on your work and ask questions without feeling intimidated or judged.
    4. Accountability: stay motivated and accountable to your learning goals by surrounding yourself with others who are also committed to learning Python.
    5. Feedback from the instructor: receiving feedback and support from an instructor with years of experience in the field.

    I know how important these success factors are for growth and progress in mastering Python.

    That is why I want to make them available to anyone struggling to learn or who just wants to improve faster.

    With the Python Circle community, you can take advantage of 5 key success factors every programmer must have.

    Python Circle
    Python Circle

    Be part of something bigger and join the Python Circle community.

    Leave a Comment