Pandas and Folium: Categorize GDP Growth by Country and Visualize on Map in 3 Easy Steps

What will we cover in this tutorial?

  • We will gather data from wikipedia.org List of countries by past and projected GDP using pandas.
  • First step will be get the data and merge the correct tables together.
  • Next step is using Machine Learning with Linear regression model to estimate the growth of each country GDP.
  • Final step is to visualize the growth rates on a leaflet map using folium.

Step 1: Get the data and merge it

The data is available on wikipedia on List of countries by past and projected GDP. We will focus on data from 1990 to 2019.

At first glance on the page you notice that the date is not gathered in one table.

From wikipedia.org

The first task will be to merge the three tables with the data from 1990-1999, 2000-2009, and 2010-2019.

The data can be collected by pandas read_html function. If you are new to this you can read this tutorial.

import pandas as pd

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])

print(table)

The call to read_html will return all the tables in a list. By inspecting the results you will notice that we are interested in table 9, 12 and 15 and merge them. The output of the above will be.

     Country (or dependent territory)       1990       1991       1992       1993       1994       1995       1996       1997       1998       1999        2000        2001        2002        2003        2004        2005        2006        2007        2008        2009        2010        2011        2012        2013        2014        2015        2016        2017        2018        2019
0                         Afghanistan        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN         NaN         NaN      4367.0      4514.0      5146.0      6167.0      6925.0      8556.0     10297.0     12066.0     15325.0     17890.0     20296.0     20170.0     20352.0     19687.0     19454.0     20235.0     19585.0     19990.0
1                             Albania     2221.0     1333.0      843.0     1461.0     2361.0     2882.0     3200.0     2259.0     2560.0     3209.0      3483.0      3928.0      4348.0      5611.0      7185.0      8052.0      8905.0     10675.0     12901.0     12093.0     11938.0     12896.0     12323.0     12784.0     13238.0     11393.0     11865.0     13055.0     15202.0     15960.0
2                             Algeria    61892.0    46670.0    49217.0    50963.0    42426.0    42066.0    46941.0    48178.0    48188.0    48845.0     54749.0     54745.0     56761.0     67864.0     85327.0    103198.0    117027.0    134977.0    171001.0    137054.0    161207.0    199394.0    209005.0    209703.0    213518.0    164779.0    159049.0    167555.0    180441.0    183687.0
3                              Angola    11236.0    10891.0     8398.0     6095.0     4438.0     5539.0     6535.0     7675.0     6506.0     6153.0      9130.0      8936.0     12497.0     14189.0     19641.0     28234.0     41789.0     60449.0     84178.0     75492.0     82471.0    104116.0    115342.0    124912.0    126777.0    102962.0     95337.0    122124.0    107316.0     92191.0
4                 Antigua and Barbuda      459.0      482.0      499.0      535.0      589.0      577.0      634.0      681.0      728.0      766.0       825.0       796.0       810.0       850.0       912.0      1013.0      1147.0      1299.0      1358.0      1216.0      1146.0      1140.0      1214.0      1194.0      1273.0      1353.0      1460.0      1516.0      1626.0      1717.0
5                           Argentina   153205.0   205515.0   247987.0   256365.0   279150.0   280080.0   295120.0   317549.0   324242.0   307673.0    308491.0    291738.0    108731.0    138151.0    164922.0    199273.0    232892.0    287920.0    363545.0    334633.0    424728.0    527644.0    579666.0    611471.0    563614.0    631621.0    554107.0    642928.0    518092.0    477743.0
6                             Armenia        NaN        NaN      108.0      835.0      648.0     1287.0     1597.0     1639.0     1892.0     1845.0      1912.0      2118.0      2376.0      2807.0      3577.0      4900.0      6384.0      9206.0     11662.0      8648.0      9260.0     10142.0     10619.0     11121.0     11610.0     10529.0     10572.0     11537.0     12411.0     13105.0

Step 2: Use linear regression to estimate the growth over the last 30 years

In this section we will use Linear regression from the scikit-learn library, which is a simple prediction tool.

If you are new to Machine Learning we recommend you read this tutorial on Linear regression.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

import numpy as np

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])

row = table.iloc[1]
X = table.columns[1:].to_numpy().reshape(-1, 1)
X = X.astype(int)
Y = 1 + row.iloc[1:].pct_change()
Y = Y.cumprod().fillna(1.0).to_numpy()
Y = Y.reshape(-1, 1)

regr = LinearRegression()
regr.fit(X, Y)

Y_pred = regr.predict(X)

plt.scatter(X, Y)
plt.plot(X, Y_pred, color='red')
plt.show()

Which will result in the following plot.

Linear regression model applied on data from wikipedia.org

Which shows that the model approximates a line through the 30 years of data to estimate the growth of the country’s GDP.

Notice that we use the product (cumprod) of pct_change to be able to compare the data. If we used the data directly, we would not be possible to compare it.

We will do that for all countries to get a view of the growth. We are using the coefficient of the line, which indicates the growth rate.

import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])

coef = []
countries = []

for index, row in table.iterrows():
    #print(row)
    X = table.columns[1:].to_numpy().reshape(-1, 1)
    X = X.astype(int)
    Y = 1 + row.iloc[1:].pct_change()
    Y = Y.cumprod().fillna(1.0).to_numpy()
    Y = Y.reshape(-1, 1)

    regr = LinearRegression()
    regr.fit(X, Y)

    coef.append(regr.coef_[0][0])
    countries.append(row[merge_index])

data = pd.DataFrame(list(zip(countries, coef)), columns=['Country', 'Coef'])

print(data)

Which results in the following output (or the first few lines).

                              Country      Coef
0                         Afghanistan  0.161847
1                             Albania  0.243493
2                             Algeria  0.103907
3                              Angola  0.423919
4                 Antigua and Barbuda  0.087863
5                           Argentina  0.090837
6                             Armenia  4.699598

Step 3: Merge the data to a leaflet map using folium

The last step is to merge the data together with the leaflet map using the folium library. If you are new to folium we recommend you read this tutorial.

import pandas as pd
import folium
import geopandas
from sklearn.linear_model import LinearRegression
import numpy as np

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])

coef = []
countries = []

for index, row in table.iterrows():
    X = table.columns[1:].to_numpy().reshape(-1, 1)
    X = X.astype(int)
    Y = 1 + row.iloc[1:].pct_change()
    Y = Y.cumprod().fillna(1.0).to_numpy()
    Y = Y.reshape(-1, 1)

    regr = LinearRegression()
    regr.fit(X, Y)

    coef.append(regr.coef_[0][0])
    countries.append(row[merge_index])

data = pd.DataFrame(list(zip(countries, coef)), columns=['Country', 'Coef'])

# Read the geopandas dataset
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
# Replace United States of America to United States to fit the naming in the table
world = world.replace('United States of America', 'United States')

# Merge the two DataFrames together
table = world.merge(data, how="left", left_on=['name'], right_on=['Country'])


# Clean data: remove rows with no data
table = table.dropna(subset=['Coef'])

# We have 10 colors available resulting into 9 cuts.
table['Cat'] = pd.qcut(table['Coef'], 9, labels=[0, 1, 2, 3, 4, 5, 6, 7, 8])

print(table)

# Create a map
my_map = folium.Map()

# Add the data
folium.Choropleth(
    geo_data=table,
    name='choropleth',
    data=table,
    columns=['Country', 'Cat'],
    key_on='feature.properties.name',
    fill_color='YlGn',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Growth of GDP since 1990',
    threshold_scale=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
).add_to(my_map)
my_map.save('gdp_growth.html')

There is a twist in the way it is done. Instead of using a linear model to represent the growth rate on the map, we chose to add them in categories. The reason is that otherwise most countries group in small segment.

Here we have used the qcut to add them in each equal sized group.

This should result in an interactive html page looking something like this.

End result.

5 Steps to Master the Reinforcement Learning with a Q-Learning Python Example

What will we learn in this article?

The Q-Learning algorithm is a nice and easy to understand algorithm used with Reinforcement Learning paradigm in Machine Learning. It can be implemented from scratch and we will do that in this article.

After you go through this article you will know what Reinforcement Learning is, the main types of algorithm used, fully understand the Q-learning algorithm and implement an awesome example from scratch in Python.

The steps towards that are.

  • Learn and understand what reinforcement learning in machine learning?
  • What are the main algorithm in reinforcement learning?
  • Deep dive to understand the Q-learning algorithm
  • Implement a task that we want the Q-learning algorithm to learn – first we let a random choices try (1540 steps on average).
  • Then we implement the Q-learning algorithm from scratch and let it solve learn how to solve it (22 steps).

Step 1: What is Reinforcement Learning?

Reinforcement learning teaches the machine to think for itself based on past action rewards.

Reinforcement Learning (in Machine Learning) teaches the machine to think based on past action rewards.

Basically, the Reinforcement Learning algorithm tries to predict actions that gives rewards and avoids punishment.

It is like training a dog. You and the dog do not talk the same language, but the dogs learns how to act based on rewards (and punishment, which I do not advise or advocate).

Hence, if a dog is rewarded for a certain action in a given situation, then next time it is exposed to a similar situation it will act the same.

Translate that to Reinforcement Learning.

  • The agent is the dog that is exposed to the environment.
  • Then the agent encounters a state.
  • The agent performs an action to transition from that state to a new state.
  • Then after the transition the agent receives a reward or penalty (punishment).
  • This forms a policy to create a strategy to choose actions in a given state.

Step 2: What are the algorithm used for Reinforcement Learning?

The most common algorithm for Reinforcement Learning are.

We will focus on the Q-learning algorithm as it is easy to understand as well as powerful.

Step 3: Understand the Q-Learning algorithm

As already noted, I just love this algorithm. It is “easy” to understand and seems very powerful.

Q-Learning algorithm (Reinforcement / Machine Learning) – exploit or explore – Update Q-table

The Q-Learning algorithm has a Q-table (a Matrix of dimension state x actions – don’t worry if you do not understand what a Matrix is, you will not need the mathematical aspects of it – it is just an indexed “container” with numbers).

  • The agent (or Q-Learning algorithm) will be in a state.
  • Then in each iteration the agent needs take an action.
  • The agent will continuously update the reward in the Q-table.
  • The learning can come from either exploiting or exploring.

This translates into the following pseudo algorithm for the Q-Learning.

The agent is in a given state and needs to choose an action.

  • Initialise the Q-table to all zeros
  • Iterate:
    • Agent is in state state.
    • With probability epsilon choose to explore, else exploit.
      • If explore, then choose a random action.
      • If exploit, then choose the best action based on the current Q-table.
    • Update the Q-table from the new reward to the previous state.
      • Q[state, action] = (1 – alpha) * Q[state, action] + alpha * (reward + gamma * max(Q[new_state]) — Q[state, action])

As you can se, we have introduced the following variables.

  • epsilon: the probability to take a random action, which is done to explore new territory.
  • alpha: is the learning rate that the algorithm should make in each iteration and should be in the interval from 0 to 1.
  • gamma: is the discount factor used to balance the immediate and future reward. This value is usually between 0.8 and 0.99
  • reward: is the feedback on the action and can be any number. Negative is penalty (or punishment) and positive is a reward.

Step 4: A task we want the Q-learning algorithm to master

We need to test and understand our the above algorithm. So far, it is quite abstract. To do that we will create a simple task to show how the Q-learning algorithm will solve it efficient by learning by rewards.

To keep it simple, we create a field of size 10×10 positions. In that field there is an item that needs to be picket up and moved to a drop-off point.

At each position there are 6 different actions that can be taken.

  • Action 0: Go south if on field.
  • Action 1: Go north if on field.
  • Action 2: Go east if on field.
  • Action 3: Go west if on field.
  • Action 4: Pickup item (it can try even if it is not there)
  • Action 5: Drop-off item (it can try even if it does not have it)

Based on these action we will make a reward system.

  • If the agent tries to go off the field, punish with -10 in reward.
  • If the agent makes a (legal) move, punish with -1 in reward, as we do not want to encourage endless walking around.
  • If the agent tries to pick up item, but it is not there or it has it already, punish with 10.
  • If the agent picks up the item correct place, reward with 20.
  • If agent tries to drop-off item in wrong place or does not have the item, punish with 10.
  • If the agent drops-off item in correct place, reward with 20.

That translates into the following code. I prefer to implement this code, as I think the standard libraries that provide similar frameworks hide some important details. As an example, and shown later, how do you map this into a state in the Q-table?

class Field:
    def __init__(self, size, item_pickup, item_drop_off, start_position):
        self.size_x = size
        self.size_y = size
        self.item_in_car = False
        self.item_position = item_pickup
        self.item_drop_off = item_drop_off
        self.position = start_position

    def move_driver(self, action):
        (x, y) = self.item_position
        if action == 0: # south
            if y == 0:
                return -10, False
            else:
                self.item_position = (x, y-1)
                return -1, False
        elif action == 1: # north
            if y == self.size_y - 1:
                return -10, False
            else:
                self.item_position = (x, y+1)
                return -1, False
        elif action == 2: # east
            if x == self.size_x - 1:
                return -10, False
            else:
                self.item_position = (x+1, y)
                return -1, False
        elif action == 3: # west
            if x == 0:
                return -10, False
            else:
                self.item_position = (x-1, y)
                return -1, False
        elif action == 4: # pickup
            if self.item_in_car:
                return -10, False
            elif self.item_position != (x, y):
                return -10, False
            else:
                self.item_in_car = True
                return 20, False
        elif action == 5: # drop-off
            if not self.item_in_car:
                return -10, False
            elif self.item_drop_off != (x, y):
                self.item_position = (x, y)
                return -20, False
            else:
                return 20, True

If you let the agent just do random actions, how long will it take for it to succeed (to be done)? Let us try that out.

import random


size = 10
item_start = (0, 0)
item_drop_off = (9, 9)
start_position = (9, 0)

field = Field(size, item_start, item_drop_off, start_position)
done = False
steps = 0
while not done:
    action = random.randrange(0, 6)
    reward, done = field.move_driver(action)
    steps += 1
print(steps)

A single run of that resulted in 2756 steps. That seems to be inefficient. I ran it 1000 times to find an average, which resulted to 1540 steps on average.

Step 5: How the Q-learning algorithm can improve that

There is a learning phase where the Q-table is updated iteratively. But before that, we need to add two helper functions to our Field.

  • We need to be able to map the current it to a state to an index in the Q-table.
  • Further, we need to a get the number of states needed in the Q-table, which we need to know when we initialise the Q-table.
import numpy as np
import random


class Field:
    def __init__(self, size, item_pickup, item_drop_off, start_position):
        self.size_x = size
        self.size_y = size
        self.item_in_car = False
        self.item_position = item_pickup
        self.item_drop_off = item_drop_off
        self.position = start_position

    def get_number_of_states(self):
        return self.size_x*self.size_y*self.size_x*self.size_y*2

    def get_state(self):
        state = self.item_position[0]*(self.size_y*self.size_x*self.size_y*2)
        state += self.item_position[1]*(self.size_x*self.size_y*2)
        state += self.position[0] * (self.size_y * 2)
        state += self.position[1] * (2)
        if self.item_in_car:
            state += 1
        return state

    def move_driver(self, action):
        (x, y) = self.item_position
        if action == 0: # south
            if y == 0:
                return -10, False
            else:
                self.item_position = (x, y-1)
                return -1, False
        elif action == 1: # north
            if y == self.size_y - 1:
                return -10, False
            else:
                self.item_position = (x, y+1)
                return -1, False
        elif action == 2: # east
            if x == self.size_x - 1:
                return -10, False
            else:
                self.item_position = (x+1, y)
                return -1, False
        elif action == 3: # west
            if x == 0:
                return -10, False
            else:
                self.item_position = (x-1, y)
                return -1, False
        elif action == 4: # pickup
            if self.item_in_car:
                return -10, False
            elif self.item_position != (x, y):
                return -10, False
            else:
                self.item_in_car = True
                return 20, False
        elif action == 5: # drop-off
            if not self.item_in_car:
                return -10, False
            elif self.item_drop_off != (x, y):
                self.item_position = (x, y)
                return -20, False
            else:
                return 20, True

Then we can generate our Q-table by iterating over the task 1000 times (it is just an arbitrary number I chose). As you see, it simply just runs over the task again and again, but updates the Q-table with the “learnings” based on the reward.

states = field.get_number_of_states()
actions = 6

q_table = np.zeros((states, actions))

alpha = 0.1
gamma = 0.6
epsilon = 0.1

for i in range(1000):
    field = Field(size, item_start, item_drop_off, start_position)
    done = False
    steps = 0
    while not done:
        state = field.get_state()
        if random.uniform(0, 1) < epsilon:
            action = random.randrange(0, 6)
        else:
            action = np.argmax(q_table[state])

        reward, done = field.move_driver(action)
        next_state = field.get_state()

        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])

        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
        q_table[state, action] = new_value

        steps += 1

After that we can use it, our Q-table is updated. To test it, we will run the same code again, just with the updated Q-table.

alpha = 0.1
gamma = 0.6
epsilon = 0.1

field = Field(size, item_start, item_drop_off, start_position)
done = False
steps = 0
while not done:
    state = field.get_state()
    if random.uniform(0, 1) < epsilon:
        action = random.randrange(0, 6)
    else:
        action = np.argmax(q_table[state])

    reward, done = field.move_driver(action)
    next_state = field.get_state()

    old_value = q_table[state, action]
    next_max = np.max(q_table[next_state])

    new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
    q_table[state, action] = new_value

    steps += 1

print(steps)

This resulted in 22 steps. That is awesome.

4 Easy Steps to Understand Unsupervised Machine Learning with an Example in Python

Step 1: Learn what is unsupervised machine learning?

An unsupervised machine learning model takes unlabelled (or categorised) data and lets the algorithm determined the answer for us.

Unsupervised Machine Learning model – takes unstructured data and finds patterns itself

The unsupervised machine learning model data without apparent structures and tries to identify some patterns itself to create categories.

Step 2: Understand the main types of unsupervised machine learning

There are two main types of unsupervised machine learning types.

  • Clustering: Is used for grouping data into categories without knowing any labels before hand.
  • Association: Is a rule-based for discovering interesting relations between variables in large databases.

In clustering the main algorithms used are K-means, hierarchy clustering, and hidden Markov model.

And in the association the main algorithm used are Apriori and FP-growth.

Step 3: How does K-means work

The K-means works in iterative steps

The k-means algorithm starts is an NP-hard problem, which mean there is no efficient way to solve in the general case. For this problem there are heuristics algorithms that converge fast to local optimum, which means you can find some optimum fast, but it might not be the best one, but often they can do just fine.

Enough, theory.

How does the algorithm work.

  • Step 1: Start by a set of k means. These can be chosen by taking k random point from the dataset (called the Random Partition initialisation method).
  • Step 2: Group each data point into the cluster of the nearest mean. Hence, each data point will be assigned to exactly one cluster.
  • Step 3: Recalculate the the means (also called centroids) to converge towards local optimum.

Steps 2 and 3 are repeated until the grouping in Step 2 does not change any more.

Step 4: A simple Python example with the k-means algorithm

In this example we are going to start assuming you have the basic knowledge how to install the needed libraries. If not, then see the following article.

First of, you need to import the needed libraries.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
from sklearn.cluster import KMeans

In the first basic example we are only going to plot some points on a graph.

style.use('ggplot')

x = [1, 2, 0.3, 9.2, 2.4,  9, 12]
y = [2, 4, 2.5, 8.5, 0.3, 11, 10]
plt.scatter(x, y)
plt.show()

The first line sets a style of the graph. Then we have the coordinates in the arrays x and y. This format is used to feed the scatter.

Output of the plot from scatter plotter in Python.

An advantage of plotting the points before you figure out how many clusters you want to use. Here it looks like there are two “groups” of plots, which translates into using to clusters.

To continue, we want to use the k means algorithm with two clusters.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
from sklearn.cluster import KMeans

style.use('ggplot')

x = [1, 2, 0.3, 9.2, 2.4,  9, 12]
y = [2, 4, 2.5, 8.5, 0.3, 11, 10]

# We need to transform the input coordinates to plot use the k means algorithm
X = []
for i in range(len(x)):
    X.append([x[i], y[i]])
X = np.array(X)

# The number of clusters
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
labels = kmeans.labels_

# Then we want to have different colors for each type.
colors = ['g.', 'r.']
for i in range(len(X)):
    # And plot them one at the time
    plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize=10)

# Plot the centres (or means)
plt.scatter(centroids[:, 0], centroids[:, 1], marker= "x", s=150, linewidths=5, zorder=10)
plt.show()

This results in the following result.

Example of k means algorithm used on simple dataset

Considerations when using K-Means algorithm

We could have changed to use 3 clusters. That would have resulted in the following output.

Using 3 clusters instead of two in the k-mean algorithm

This is not optimal for this dataset, but could be hard to predict without this visual representation of the dataset.

Uses of K-Means algorithm

Here are some interesting uses of the K-means algorithms:

  • Personalised marketing to users
  • Identifying fake news
  • Spam filter in your inbox
Exit mobile version