Pandas and Folium: Categorize GDP Growth by Country and Visualize on Map in 3 Easy Steps

What will we cover in this tutorial?

  • We will gather data from wikipedia.org List of countries by past and projected GDP using pandas.
  • First step will be get the data and merge the correct tables together.
  • Next step is using Machine Learning with Linear regression model to estimate the growth of each country GDP.
  • Final step is to visualize the growth rates on a leaflet map using folium.

Step 1: Get the data and merge it

The data is available on wikipedia on List of countries by past and projected GDP. We will focus on data from 1990 to 2019.

At first glance on the page you notice that the date is not gathered in one table.

From wikipedia.org

The first task will be to merge the three tables with the data from 1990-1999, 2000-2009, and 2010-2019.

The data can be collected by pandas read_html function. If you are new to this you can read this tutorial.

import pandas as pd

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])

print(table)

The call to read_html will return all the tables in a list. By inspecting the results you will notice that we are interested in table 9, 12 and 15 and merge them. The output of the above will be.

     Country (or dependent territory)       1990       1991       1992       1993       1994       1995       1996       1997       1998       1999        2000        2001        2002        2003        2004        2005        2006        2007        2008        2009        2010        2011        2012        2013        2014        2015        2016        2017        2018        2019
0                         Afghanistan        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN         NaN         NaN      4367.0      4514.0      5146.0      6167.0      6925.0      8556.0     10297.0     12066.0     15325.0     17890.0     20296.0     20170.0     20352.0     19687.0     19454.0     20235.0     19585.0     19990.0
1                             Albania     2221.0     1333.0      843.0     1461.0     2361.0     2882.0     3200.0     2259.0     2560.0     3209.0      3483.0      3928.0      4348.0      5611.0      7185.0      8052.0      8905.0     10675.0     12901.0     12093.0     11938.0     12896.0     12323.0     12784.0     13238.0     11393.0     11865.0     13055.0     15202.0     15960.0
2                             Algeria    61892.0    46670.0    49217.0    50963.0    42426.0    42066.0    46941.0    48178.0    48188.0    48845.0     54749.0     54745.0     56761.0     67864.0     85327.0    103198.0    117027.0    134977.0    171001.0    137054.0    161207.0    199394.0    209005.0    209703.0    213518.0    164779.0    159049.0    167555.0    180441.0    183687.0
3                              Angola    11236.0    10891.0     8398.0     6095.0     4438.0     5539.0     6535.0     7675.0     6506.0     6153.0      9130.0      8936.0     12497.0     14189.0     19641.0     28234.0     41789.0     60449.0     84178.0     75492.0     82471.0    104116.0    115342.0    124912.0    126777.0    102962.0     95337.0    122124.0    107316.0     92191.0
4                 Antigua and Barbuda      459.0      482.0      499.0      535.0      589.0      577.0      634.0      681.0      728.0      766.0       825.0       796.0       810.0       850.0       912.0      1013.0      1147.0      1299.0      1358.0      1216.0      1146.0      1140.0      1214.0      1194.0      1273.0      1353.0      1460.0      1516.0      1626.0      1717.0
5                           Argentina   153205.0   205515.0   247987.0   256365.0   279150.0   280080.0   295120.0   317549.0   324242.0   307673.0    308491.0    291738.0    108731.0    138151.0    164922.0    199273.0    232892.0    287920.0    363545.0    334633.0    424728.0    527644.0    579666.0    611471.0    563614.0    631621.0    554107.0    642928.0    518092.0    477743.0
6                             Armenia        NaN        NaN      108.0      835.0      648.0     1287.0     1597.0     1639.0     1892.0     1845.0      1912.0      2118.0      2376.0      2807.0      3577.0      4900.0      6384.0      9206.0     11662.0      8648.0      9260.0     10142.0     10619.0     11121.0     11610.0     10529.0     10572.0     11537.0     12411.0     13105.0

Step 2: Use linear regression to estimate the growth over the last 30 years

In this section we will use Linear regression from the scikit-learn library, which is a simple prediction tool.

If you are new to Machine Learning we recommend you read this tutorial on Linear regression.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

import numpy as np

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])

row = table.iloc[1]
X = table.columns[1:].to_numpy().reshape(-1, 1)
X = X.astype(int)
Y = 1 + row.iloc[1:].pct_change()
Y = Y.cumprod().fillna(1.0).to_numpy()
Y = Y.reshape(-1, 1)

regr = LinearRegression()
regr.fit(X, Y)

Y_pred = regr.predict(X)

plt.scatter(X, Y)
plt.plot(X, Y_pred, color='red')
plt.show()

Which will result in the following plot.

Linear regression model applied on data from wikipedia.org

Which shows that the model approximates a line through the 30 years of data to estimate the growth of the country’s GDP.

Notice that we use the product (cumprod) of pct_change to be able to compare the data. If we used the data directly, we would not be possible to compare it.

We will do that for all countries to get a view of the growth. We are using the coefficient of the line, which indicates the growth rate.

import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])

coef = []
countries = []

for index, row in table.iterrows():
    #print(row)
    X = table.columns[1:].to_numpy().reshape(-1, 1)
    X = X.astype(int)
    Y = 1 + row.iloc[1:].pct_change()
    Y = Y.cumprod().fillna(1.0).to_numpy()
    Y = Y.reshape(-1, 1)

    regr = LinearRegression()
    regr.fit(X, Y)

    coef.append(regr.coef_[0][0])
    countries.append(row[merge_index])

data = pd.DataFrame(list(zip(countries, coef)), columns=['Country', 'Coef'])

print(data)

Which results in the following output (or the first few lines).

                              Country      Coef
0                         Afghanistan  0.161847
1                             Albania  0.243493
2                             Algeria  0.103907
3                              Angola  0.423919
4                 Antigua and Barbuda  0.087863
5                           Argentina  0.090837
6                             Armenia  4.699598

Step 3: Merge the data to a leaflet map using folium

The last step is to merge the data together with the leaflet map using the folium library. If you are new to folium we recommend you read this tutorial.

import pandas as pd
import folium
import geopandas
from sklearn.linear_model import LinearRegression
import numpy as np

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])

coef = []
countries = []

for index, row in table.iterrows():
    X = table.columns[1:].to_numpy().reshape(-1, 1)
    X = X.astype(int)
    Y = 1 + row.iloc[1:].pct_change()
    Y = Y.cumprod().fillna(1.0).to_numpy()
    Y = Y.reshape(-1, 1)

    regr = LinearRegression()
    regr.fit(X, Y)

    coef.append(regr.coef_[0][0])
    countries.append(row[merge_index])

data = pd.DataFrame(list(zip(countries, coef)), columns=['Country', 'Coef'])

# Read the geopandas dataset
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
# Replace United States of America to United States to fit the naming in the table
world = world.replace('United States of America', 'United States')

# Merge the two DataFrames together
table = world.merge(data, how="left", left_on=['name'], right_on=['Country'])


# Clean data: remove rows with no data
table = table.dropna(subset=['Coef'])

# We have 10 colors available resulting into 9 cuts.
table['Cat'] = pd.qcut(table['Coef'], 9, labels=[0, 1, 2, 3, 4, 5, 6, 7, 8])

print(table)

# Create a map
my_map = folium.Map()

# Add the data
folium.Choropleth(
    geo_data=table,
    name='choropleth',
    data=table,
    columns=['Country', 'Cat'],
    key_on='feature.properties.name',
    fill_color='YlGn',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Growth of GDP since 1990',
    threshold_scale=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
).add_to(my_map)
my_map.save('gdp_growth.html')

There is a twist in the way it is done. Instead of using a linear model to represent the growth rate on the map, we chose to add them in categories. The reason is that otherwise most countries group in small segment.

Here we have used the qcut to add them in each equal sized group.

This should result in an interactive html page looking something like this.

End result.

Simple Machine Learning Trading Bot in Python – Evaluating how it Performs

What will we cover in this tutorial

  • To create a machine learning trading bot in Python
  • How to build a simple Reinforcement Learning Trading bot.
  • The idea behind the Reinforcement Learning trading bot
  • Evaluate how the trading bot performs

Machine Learning and Trading?

First thing first. Machine Learning trading bot? Machine Learning can be used for various things in regards to trading.

Well, good to set our expectations. This tutorial is also experimental and does not claim to make a bullet-proof Machine Learning Trading bot that will make you rich. I strongly advice you not to use it for automated trading.

This tutorial is only intended to test and learn about how a Reinforcement Learning strategy can be used to build a Machine Learning Trading Bot.

Step 1: The idea behind the Reinforcement Learning strategy

I wanted to test how a Reinforcement Learning algorithm would do in the market.

First let us understand what Reinforcement Learning is. Reinforcement learning teaches the machine to think for itself based on past action rewards.

Reinforcement Learning (in Machine Learning) teaches the machine to think based on past action rewards.
Reinforcement Learning (in Machine Learning) teaches the machine to think based on past action rewards.

It is like training a dog. You and the dog do not talk the same language, but the dogs learns how to act based on rewards (and punishment, which I do not advise or advocate). 

Hence, if a dog is rewarded for a certain action in a given situation, then next time it is exposed to a similar situation it will act the same. 

Translate that to Reinforcement Learning. 

  • The agent is the dog that is exposed to the environment
  • Then the agent encounters a state
  • The agent performs an action to transition from that state to a new state
  • Then after the transition the agent receives a reward or penalty(punishment).
  • This forms a policy to create a strategy to choose actions in a given state

That turns out to fit well with trading, or potentially? That is what I want to investigate.

Step 2: The idea behind how to use Reinforcement Learning in Trading

The environment in trading could be translated to rewards and penalties (punishment). You win or loose on the stock market, right?

But we also want to simplify the environment for the bot, not to make it too complex. Hence, in this experiment, the bot is only knows 1 stock and has to decide to buy, keep or sell.

Said differently.

  • The trading bot (agent) is exposed to the stock history (environment).
  • Then the trading bot (agent) encounters the new stock price (state).
  • The trading bot (agent) then performs a choice to keep, sell or buy (action), which brings it to a new state.
  • Then the trading bot (agent) will receives a reward based on the value difference from day to day.

The reward will often first be encountered after some time, hence, the feedback from steps after should be set high. Or at least, that is my expectation.

Step 3: Understand Q-learning as the Reinforcement Learning model

The Q-learning model is easy to understand and has potential to be very powerful. Of course, it is not better than the design of it. But before we can design it, we need to understand the mechanism behind it.

Q-Learning algorithm (Reinforcement / Machine Learning) - exploit or explore - Update Q-table
Q-Learning algorithm (Reinforcement / Machine Learning) – exploit or explore – Update Q-table

The Q-Learning algorithm has a Q-table (a Matrix of dimension state x actions – don’t worry if you do not understand what a Matrix is, you will not need the mathematical aspects of it – it is just an indexed “container” with numbers).

  • The agent (or Q-Learning algorithm) will be in a state.
  • Then in each iteration the agent needs take an action.
  • The agent will continuously update the reward in the Q-table.
  • The learning can come from either exploiting or exploring.

This translates into the following pseudo algorithm for the Q-Learning. 

The agent is in a given state and needs to choose an action.

  • Initialise the Q-table to all zeros
  • Iterate:
    • Agent is in state state.
    • With probability epsilon choose to explore, else exploit.
      • If explore, then choose a random action.
      • If exploit, then choose the best action based on the current Q-table.
    • Update the Q-table from the new reward to the previous state.
      • Q[stateaction] = (1 – alpha) * Q[stateaction] + alpha * (rewardgamma * max(Q[new_state]) — Q[state, action])

As you can se, we have introduced the following variables.

  • epsilon: the probability to take a random action, which is done to explore new territory.
  • alpha: is the learning rate that the algorithm should make in each iteration and should be in the interval from 0 to 1.
  • gamma: is the discount factor used to balance the immediate and future reward. This value is usually between 0.8 and 0.99
  • reward: is the feedback on the action and can be any number. Negative is penalty (or punishment) and positive is a reward.

Step 4: The choices we need to take

Based on that, we need to see how the algorithm should map the stock information to a state. We want the model to be fairly simple and not have too many states, as it will take long time to populate it with data.

There are many parameters to choose from here. As we do not want to tell the algorithm what to do, we still need to feed it what what we find as relevant data.

In this case it was the following.

  • Volatility of the share.
  • The percentage change of the daily short mean (average over last 20 days).
  • Then the percentage of the daily long mean (average over the last 100 days).
  • The daily long mean, which is the average over the last 100 days.
  • The volume of the sales that day.

These values need to be calculated for the share we use. That can be done by the following code.

import pandas_datareader as pdr
import numpy as np


VALUE = 'Adj Close'
ID = 'id'
NAME = 'name'
DATA = 'data'


def get_data(name, years_ago):
    start = dt.datetime.now() - relativedelta(years=years_ago)
    end = dt.datetime.now()
    df = pdr.get_data_yahoo(name, start, end)
    return df


def process():
    stock = {ID: stock, NAME: 'AAPL'}

    stock[DATA] = get_data(stock[ID], 20)

# Updatea it will all values
    stock[DATA]['Short Mean'] = stock[DATA][VALUE].rolling(window=short_window).mean()
    stock[DATA]['Long Mean'] = stock[DATA][VALUE].rolling(window=long_window).mean()

    stock[DATA]['Daily Change'] = stock[DATA][VALUE].pct_change()
    stock[DATA]['Daily Short Change'] = stock[DATA]['Short Mean'].pct_change()
    stock[DATA]['Daily Long Change'] = stock[DATA]['Long Mean'].pct_change()
    stock[DATA]['Volatility'] = stock[DATA]['Daily Change'].rolling(75).std()*np.sqrt(75)

As you probably notice, this will create a challenge. You need to put them into bins, that is a fixed number of “boxes” to fit in.

def process():
    #...
    # Let's put data in bins
    stock[DATA]['Vla bin'] = pd.cut(stock[DATA]['Volatility'], bins=STATES_DIM, labels=False)
    stock[DATA]['Srt ch bin'] = pd.cut(stock[DATA]['Daily Short Change'], bins=STATES_DIM, labels=False)
    stock[DATA]['Lng ch bin'] = pd.cut(stock[DATA]['Daily Long Change'], bins=STATES_DIM, labels=False)
    # stock[DATA]['Srt mn bin'] = pd.cut(stock[DATA]['Short Mean'], bins=DIM, labels=False)
    stock[DATA]['Lng mn bin'] = pd.cut(stock[DATA]['Long Mean'], bins=STATES_DIM, labels=False)
    stock[DATA]['Vol bin'] = pd.cut(stock[DATA]['Volume'], bins=STATES_DIM, labels=False)

This will quantify the 5 dimensions into STATES_DIM, which you can define to what you think is appropriate.

Step 5: How to model it

This can be done by creating an environment, that will play the role as your trading account.

class Account:
    def __init__(self, cash=1000000, brokerage=0.001):
        self.cash = cash
        self.brokerage = brokerage
        self.stocks = 0
        self.stock_id = None
        self.has_stocks = False

    def get_value(self, row):
        if self.has_stocks:
            return self.cash + row[VALUE] * self.stocks
        else:
            return self.cash

    def buy_stock(self, stock_id, row):
        if self.has_stocks:
            return
        self.stock_id = stock_id
        self.stocks = int(self.cash // (row[VALUE]*(1.0 + self.brokerage)))
        self.cash -= self.stocks*row[VALUE]*1.001
        self.has_stocks = True
        self.print_status(row, "Buy")

    def sell_stock(self, row):
        if not self.has_stocks:
            return
        self.print_status(row, "Sell")
        self.cash += self.stocks * (row[VALUE]*(1.0 - self.brokerage))
        self.stock_id = None
        self.stocks = 0
        self.has_stocks = False

    def print_status(self, row, title="Status"):
        if self.has_stocks:
            print(title, self.stock_id, "TOTAL:", self.cash + self.stocks*float(row[VALUE]))
            print(" - ", row.name, "price", row[VALUE])
            print(" - ", "Short", row['Daily Short Change'])
            print(" - ", "Long", row['Daily Long Change'])
        else:
            print(title, "TOTAL", self.cash)

Then it should be iterated over a time where the trading bot can decide what to do.

def process():
    # Now let's prepare our model
    q_learning = QModel()
    account = Account()

    state = None
    reward = 0.0
    action = 0
    last_value = 0.0
    for index, row in stock[DATA].iterrows():
        if state is not None:
            # The reward is the immediate return
            reward = account.get_value(row) - last_value
            # You update the day after the action, when you know the results of your actions
            q_learning.update_reward(row, account.has_stocks, action, state, reward)
        action, state = q_learning.get_action(row, account.has_stocks)

        if action == 0:
            pass
        elif action == 1:
            if account.has_stocks:
                account.sell_stock(row)
            else:
                account.buy_stock(stock[ID], row)
        last_value = account.get_value(row)
    account.print_status(row)
    q_learning.save_pickle()
    return last_value

This code will do what ever the trading bot tells you to do.

Step 6: The Q-learning model

Now to the core of the thing. The actual trading bot, that knows nothing about trading. But can we train it to earn money on trading and how much? We will see that later.

class QModel:
    def __init__(self, alpha=0.5, gamma=0.7, epsilon=0.1):
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon

        self.states_per_dim = STATES_DIM
        self.dim = 5
        self.states = (self.states_per_dim ** self.dim) * 2
        self.actions = 2
        self.pickle = "q_model7.pickle"
        self.q_table = np.zeros((self.states, self.actions))
        if os.path.isfile(self.pickle):
            print("Loading pickle")
            with open(self.pickle, "rb") as f:
                self.q_table = pickle.load(f)

    def save_pickle(self):
        with open(self.pickle, "wb") as f:
            pickle.dump(self.q_table, f)

    def get_state(self, row, has_stock):
        dim = []
        dim.append(int(row['Vla bin']))
        dim.append(int(row['Srt ch bin']))
        dim.append(int(row['Lng ch bin']))
        dim.append(int(row['Lng mn bin']))
        dim.append(int(row['Vol bin']))
        for i in range(len(dim)):
            if dim[i] is None:
                dim[i] = 0
        dimension = 0
        if has_stock:
            dimension = 1 * (self.states_per_dim ** self.dim)
        dimension += dim[4] * (self.states_per_dim ** 4)
        dimension += dim[3] * (self.states_per_dim ** 3)
        dimension += dim[2] * (self.states_per_dim ** 2)
        dimension += dim[1] * (self.states_per_dim ** 1)
        dimension += dim[0]
        return dimension

    def get_action(self, row, has_stock):
        state = self.get_state(row, has_stock)

        if random.uniform(0, 1) < self.epsilon:
            action = random.randrange(0, self.actions)
        else:
            action = np.argmax(self.q_table[state])
        return action, state

    def update_reward(self, row, has_stock, last_action, last_state, reward):
        next_state = self.get_state(row, has_stock)

        old_value = self.q_table[last_state, last_action]
        next_max = np.max(self.q_table[next_state])

        new_value = (1 - self.alpha) * old_value + self.alpha * (reward + self.gamma * next_max)
        self.q_table[last_state, last_action] = new_value

Now we have the full code to try it out (the full code is at the end of the tutorial).

Step 7: Training the model

Now we need to train the model.

For that purpose, I have made a list of 134 stocks that I used and placed them in a CSV file.

Then the training is simply to read 1 of the 134 stocks in with 10 years of historical data. Find an 1 year window and run the algorithm on it.

The repeat.

f __name__ == "__main__":
    # source: http://www.nasdaqomxnordic.com/shares/listed-companies/copenhagen
    csv_stock_file = 'DK-Stocks.csv'

    while True:
        iterations = 1000
        for i in range(iterations):
            # Go at most 9 years back, as we only have 10 years available and need 1 year of data
            days_back = random.randrange(0, 9*365)
            process(csv_stock_file)

Then let it run and run and run and run again.

Step 8: Testing the algorithm

Of course, the testing should be done on unknown data. That is a stock it does not know. But you cannot also re-run on the same stock, as it will learn from it (unless you do not save the state from it).

Hence, I chose a good performing stock to see how it would do, to see if it could beat the buy-first-day-and-sell-last-day strategy.

The results of the trading bot on Apple stocks.

The return of 1,000,000$ investment with the Trading Bot was approximately 1,344,500$. This is a return on 34% for one year. Comparing that with the stock price itself.

Stock price was 201.55$ on July 1st 2019 and 362.09$ on June 30th, 2020. This would give the following return (0,10% in brokerage should be included in calculations as the Trading bot pays that on each sell and buy).

  • 1,792,847$

That does not look that good. That means that a simple strategy to buy on day one and sell on the last day would return more than the bot.

Of course, you can’t conclude it is not possible to do better on other stocks, but for this case it was not impressive.

Variations and next step

There are many variable to adjust, I especially think I set the gamma too low. There are other parameters to use to make the state. Can remove some, that might be making noice, and add ones that are more relevant. Also, the number of bins can be adjusted. That the bins are made independent of each other, might also be a problem.

Also read the tutorial on reinforcement learning.

How to Create a Sentiment Analysis model to Predict the Mood of Tweets with Python – 4 Steps to Compare the Mood of Python vs Java

What will we cover in this tutorial?

  • We will learn how the supervised Machine Learning algorithm Sentiment Analysis can be used on twitter data (also, called tweets).
  • The model we use will be Naive Bayes Classifier.
  • The tutorial will help install the necessary Python libraries to get started and how to download training data.
  • Then it will give you a full script to train the model.
  • Finally, we will use the trained model to compare the “mood” of Python with Java.

Step 1: Install the Natural Language Toolkit Library and Download Collections

We will use the Natural Language Toolkit (nltk) library in this tutorial.

NLTK is a leading platform for building Python programs to work with human language data.

http://www.nltk.org

To install the library you should run the following command in a terminal or see here for other alternatives.

pip install nltk

To have the data available that you need to run the following program or see installing NLTK Data.

import nltk
nltk.download()

This will prompt you with a screen similar to this. And select all packages you want to install (I took them all).

Download all packages to NLTK (Natural Language Toolkit)
Download all packages to NLTK (Natural Language Toolkit)

After download you can use the twitter_samples as you need in the example.

Step 2: Reminder of the Sentiment Analysis learning process (Machine Learning)

On a high level you can divide Machine Learning into two phases.

  • Phase 1: Learning
  • Phase 2: Prediction

The Sentiment Analysis model is supervised learning process. The process is defined in the picture below.

The Sentiment Analysis model (Machine Learning) Learning phase
The Sentiment Analysis model (Supervised Machine Learning) Learning phase

On a high level the the learning process of Sentiment Analysis model has the following steps.

  • Training & test data
    • The Sentiment Analysis model is a supervised learning and needs data representing the data that the model should predict. We will use tweets.
    • The data should be categorized into the groups it should be able to distinguish. In our example it will be in positive tweets and negative tweets.
  • Pre-processing
    • First you need to remove “noise”. In our case we remove URL links and Twitter user names.
    • Then you Lemmatize the data to have the words in the same form.
    • Further, you remove stop words as they have no impact of the mood in the tweet.
    • The data then needs to be formatted for the algorithm.
    • Finally, you need to divide it into a training data and testing data.
  • Learning
    • This is where the algorithm builds the model using the training data.
  • Testing
    • Then we test the accuracy of the model with the categorized test data.

Step 3: Train the Sample Data

The twitter_sample contains 5000 positive and 5000 negative tweets, all ready and classified to use in for your training model.

import random
import pickle

from nltk.corpus import twitter_samples
from nltk.stem import WordNetLemmatizer
from nltk.tag import pos_tag
from nltk.corpus import stopwords
from nltk import NaiveBayesClassifier
from nltk import classify


def clean_data(token):
    return [item for item in token if not item.startswith("http") and not item.startswith("@")]


def lemmatization(token):
    lemmatizer = WordNetLemmatizer()

    result = []
    for token, tag in pos_tag(token):
        tag = tag[0].lower()
        token = token.lower()
        if tag in "nva":
            result.append(lemmatizer.lemmatize(token, pos=tag))
        else:
            result.append(lemmatizer.lemmatize(token))
    return result


def remove_stop_words(token, stop_words):
    return [item for item in token if item not in stop_words]


def transform(token):
    result = {}
    for item in token:
        result[item] = True
    return result


def main():
    # Step 1: Gather data
    positive_tweets_tokens = twitter_samples.tokenized('positive_tweets.json')
    negative_tweets_tokens = twitter_samples.tokenized('negative_tweets.json')

    # Step 2: Clean, Lemmatize, and remove Stop Words
    stop_words = stopwords.words('english')
    positive_tweets_tokens_cleaned = [remove_stop_words(lemmatization(clean_data(token)), stop_words) for token in positive_tweets_tokens]
    negative_tweets_tokens_cleaned = [remove_stop_words(lemmatization(clean_data(token)), stop_words) for token in negative_tweets_tokens]

    # Step 3: Transform data
    positive_tweets_tokens_transformed = [(transform(token), "Positive") for token in positive_tweets_tokens_cleaned]
    negative_tweets_tokens_transformed = [(transform(token), "Negative") for token in negative_tweets_tokens_cleaned]


    # Step 4: Create data set
    dataset = positive_tweets_tokens_transformed + negative_tweets_tokens_transformed
    random.shuffle(dataset)

    train_data = dataset[:7000]
    test_data = dataset[7000:]

    # Step 5: Train data
    classifier = NaiveBayesClassifier.train(train_data)

    # Step 6: Test accuracy
    print("Accuracy is:", classify.accuracy(classifier, test_data))
    print(classifier.show_most_informative_features(10))

    # Step 7: Save the pickle
    f = open('my_classifier.pickle', 'wb')
    pickle.dump(classifier, f)
    f.close()


if __name__ == "__main__":
    main()

The code is structured in steps. If you are not comfortable how a the flow of a general machine learning flow is, I can recommend to read this tutorial here or this one.

  • Step 1: Collect and categorize It reads the 5000 positive and 5000 negative twitter samples we downloaded with the nltk.download() call.
  • Step 2: The data needs to be cleaned, Lemmatized and removed for stop words.
    • The clean_data call removes links and twitter users.
    • The call to lemmatization puts words in their base form.
    • The call to remove_stop_words removes all the stop words that have no affect on the mood of the sentence.
  • Step 3: Format data This step transforms the data to the desired format for the NaiveBayesClassifier module.
  • Step 4: Divide data Creates the full data set. Makes a shuffle to take them in different order. Then takes 70% as training data and 30% test data.
    • This data is mixed different from run to run. Hence, it might happen that you will not get the same accuracy like I will in my run.
    • The training data is used to make the model to predict from.
    • The test data is used to compute the accuracy of the model to predict.
  • Step 5: Training model This is the training of the NaiveBayesClassifier model.
    • This is where all the magic happens.
  • Step 6: Accuracy This is testing the accuracy of the model.
  • Step 7: Persist To save the model for use.

I got the following output from the above program.

Accuracy is: 0.9973333333333333
Most Informative Features
                      :) = True           Positi : Negati =   1010.7 : 1.0
                     sad = True           Negati : Positi =     25.4 : 1.0
                     bam = True           Positi : Negati =     20.2 : 1.0
                  arrive = True           Positi : Negati =     18.3 : 1.0
                     x15 = True           Negati : Positi =     17.2 : 1.0
               community = True           Positi : Negati =     14.7 : 1.0
                    glad = True           Positi : Negati =     12.6 : 1.0
                   enjoy = True           Positi : Negati =     12.0 : 1.0
                    kill = True           Negati : Positi =     12.0 : 1.0
                     ugh = True           Negati : Positi =     11.3 : 1.0
None

Step 4: Use the Sentiment Analysis prediction model

Now we can determine the mood of a tweet. To have some fun let us try to figure out the mood of tweets with Python and compare it with Java.

To do that, you need to have setup your twitter developer account. If you do not have that already, then see the this tutorial on how to do that.

In the code below you need to fill out your consumer_key, consumer_secret, access_token, and access_token_secret.

import pickle
import tweepy


def get_twitter_api():
    # personal details
    consumer_key = "___INSERT YOUR DATA HERE___"
    consumer_secret = "___INSERT YOUR DATA HERE___"
    access_token = "___INSERT YOUR DATA HERE___"
    access_token_secret = "___INSERT YOUR DATA HERE___"

    # authentication of consumer key and secret
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

    # authentication of access token and secret
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth)
    return api


# This function uses the functions from the learner code above
def tokenize(tweet):
    return remove_noise(word_tokenize(tweet))


def get_classifier(pickle_name):
    f = open(pickle_name, 'rb')
    classifier = pickle.load(f)
    f.close()
    return classifier


def find_mood(search):
    classifier = get_classifier('my_classifier.pickle')

    api = get_twitter_api()

    stat = {
        "Positive": 0,
        "Negative": 0
    }
    for tweet in tweepy.Cursor(api.search, q=search).items(1000):
        custom_tokens = tokenize(tweet.text)

        category = classifier.classify(dict([token, True] for token in custom_tokens))
        stat[category] += 1

    print("The mood of", search)
    print(" - Positive", stat["Positive"], round(stat["Positive"]*100/(stat["Positive"] + stat["Negative"]), 1))
    print(" - Negative", stat["Negative"], round(stat["Negative"]*100/(stat["Positive"] + stat["Negative"]), 1))


if __name__ == "__main__":
    find_mood("#java")
    find_mood("#python")

That is it. Obviously the mood of Python is better. It is easier than Java.

The mood of #java
 - Positive 524 70.4
 - Negative 220 29.6
The mood of #python
 - Positive 753 75.3
 - Negative 247 24.7

If you want to learn more about Python I can encourage you to take my course here.

5 Steps to Master the Reinforcement Learning with a Q-Learning Python Example

What will we learn in this article?

The Q-Learning algorithm is a nice and easy to understand algorithm used with Reinforcement Learning paradigm in Machine Learning. It can be implemented from scratch and we will do that in this article.

After you go through this article you will know what Reinforcement Learning is, the main types of algorithm used, fully understand the Q-learning algorithm and implement an awesome example from scratch in Python.

The steps towards that are.

  • Learn and understand what reinforcement learning in machine learning?
  • What are the main algorithm in reinforcement learning?
  • Deep dive to understand the Q-learning algorithm
  • Implement a task that we want the Q-learning algorithm to learn – first we let a random choices try (1540 steps on average).
  • Then we implement the Q-learning algorithm from scratch and let it solve learn how to solve it (22 steps).

Step 1: What is Reinforcement Learning?

Reinforcement learning teaches the machine to think for itself based on past action rewards.

Reinforcement Learning (in Machine Learning) teaches the machine to think based on past action rewards.
Reinforcement Learning (in Machine Learning) teaches the machine to think based on past action rewards.

Basically, the Reinforcement Learning algorithm tries to predict actions that gives rewards and avoids punishment.

It is like training a dog. You and the dog do not talk the same language, but the dogs learns how to act based on rewards (and punishment, which I do not advise or advocate).

Hence, if a dog is rewarded for a certain action in a given situation, then next time it is exposed to a similar situation it will act the same.

Translate that to Reinforcement Learning.

  • The agent is the dog that is exposed to the environment.
  • Then the agent encounters a state.
  • The agent performs an action to transition from that state to a new state.
  • Then after the transition the agent receives a reward or penalty (punishment).
  • This forms a policy to create a strategy to choose actions in a given state.

Step 2: What are the algorithm used for Reinforcement Learning?

The most common algorithm for Reinforcement Learning are.

We will focus on the Q-learning algorithm as it is easy to understand as well as powerful.

Step 3: Understand the Q-Learning algorithm

As already noted, I just love this algorithm. It is “easy” to understand and seems very powerful.

Q-Learning algorithm (Reinforcement / Machine Learning) - exploit or explore - Update Q-table
Q-Learning algorithm (Reinforcement / Machine Learning) – exploit or explore – Update Q-table

The Q-Learning algorithm has a Q-table (a Matrix of dimension state x actions – don’t worry if you do not understand what a Matrix is, you will not need the mathematical aspects of it – it is just an indexed “container” with numbers).

  • The agent (or Q-Learning algorithm) will be in a state.
  • Then in each iteration the agent needs take an action.
  • The agent will continuously update the reward in the Q-table.
  • The learning can come from either exploiting or exploring.

This translates into the following pseudo algorithm for the Q-Learning.

The agent is in a given state and needs to choose an action.

  • Initialise the Q-table to all zeros
  • Iterate:
    • Agent is in state state.
    • With probability epsilon choose to explore, else exploit.
      • If explore, then choose a random action.
      • If exploit, then choose the best action based on the current Q-table.
    • Update the Q-table from the new reward to the previous state.
      • Q[state, action] = (1 – alpha) * Q[state, action] + alpha * (reward + gamma * max(Q[new_state]) — Q[state, action])

As you can se, we have introduced the following variables.

  • epsilon: the probability to take a random action, which is done to explore new territory.
  • alpha: is the learning rate that the algorithm should make in each iteration and should be in the interval from 0 to 1.
  • gamma: is the discount factor used to balance the immediate and future reward. This value is usually between 0.8 and 0.99
  • reward: is the feedback on the action and can be any number. Negative is penalty (or punishment) and positive is a reward.

Step 4: A task we want the Q-learning algorithm to master

We need to test and understand our the above algorithm. So far, it is quite abstract. To do that we will create a simple task to show how the Q-learning algorithm will solve it efficient by learning by rewards.

To keep it simple, we create a field of size 10×10 positions. In that field there is an item that needs to be picket up and moved to a drop-off point.

At each position there are 6 different actions that can be taken.

  • Action 0: Go south if on field.
  • Action 1: Go north if on field.
  • Action 2: Go east if on field.
  • Action 3: Go west if on field.
  • Action 4: Pickup item (it can try even if it is not there)
  • Action 5: Drop-off item (it can try even if it does not have it)

Based on these action we will make a reward system.

  • If the agent tries to go off the field, punish with -10 in reward.
  • If the agent makes a (legal) move, punish with -1 in reward, as we do not want to encourage endless walking around.
  • If the agent tries to pick up item, but it is not there or it has it already, punish with 10.
  • If the agent picks up the item correct place, reward with 20.
  • If agent tries to drop-off item in wrong place or does not have the item, punish with 10.
  • If the agent drops-off item in correct place, reward with 20.

That translates into the following code. I prefer to implement this code, as I think the standard libraries that provide similar frameworks hide some important details. As an example, and shown later, how do you map this into a state in the Q-table?

class Field:
    def __init__(self, size, item_pickup, item_drop_off, start_position):
        self.size_x = size
        self.size_y = size
        self.item_in_car = False
        self.item_position = item_pickup
        self.item_drop_off = item_drop_off
        self.position = start_position

    def move_driver(self, action):
        (x, y) = self.item_position
        if action == 0: # south
            if y == 0:
                return -10, False
            else:
                self.item_position = (x, y-1)
                return -1, False
        elif action == 1: # north
            if y == self.size_y - 1:
                return -10, False
            else:
                self.item_position = (x, y+1)
                return -1, False
        elif action == 2: # east
            if x == self.size_x - 1:
                return -10, False
            else:
                self.item_position = (x+1, y)
                return -1, False
        elif action == 3: # west
            if x == 0:
                return -10, False
            else:
                self.item_position = (x-1, y)
                return -1, False
        elif action == 4: # pickup
            if self.item_in_car:
                return -10, False
            elif self.item_position != (x, y):
                return -10, False
            else:
                self.item_in_car = True
                return 20, False
        elif action == 5: # drop-off
            if not self.item_in_car:
                return -10, False
            elif self.item_drop_off != (x, y):
                self.item_position = (x, y)
                return -20, False
            else:
                return 20, True

If you let the agent just do random actions, how long will it take for it to succeed (to be done)? Let us try that out.

import random


size = 10
item_start = (0, 0)
item_drop_off = (9, 9)
start_position = (9, 0)

field = Field(size, item_start, item_drop_off, start_position)
done = False
steps = 0
while not done:
    action = random.randrange(0, 6)
    reward, done = field.move_driver(action)
    steps += 1
print(steps)

A single run of that resulted in 2756 steps. That seems to be inefficient. I ran it 1000 times to find an average, which resulted to 1540 steps on average.

Step 5: How the Q-learning algorithm can improve that

There is a learning phase where the Q-table is updated iteratively. But before that, we need to add two helper functions to our Field.

  • We need to be able to map the current it to a state to an index in the Q-table.
  • Further, we need to a get the number of states needed in the Q-table, which we need to know when we initialise the Q-table.
import numpy as np
import random


class Field:
    def __init__(self, size, item_pickup, item_drop_off, start_position):
        self.size_x = size
        self.size_y = size
        self.item_in_car = False
        self.item_position = item_pickup
        self.item_drop_off = item_drop_off
        self.position = start_position

    def get_number_of_states(self):
        return self.size_x*self.size_y*self.size_x*self.size_y*2

    def get_state(self):
        state = self.item_position[0]*(self.size_y*self.size_x*self.size_y*2)
        state += self.item_position[1]*(self.size_x*self.size_y*2)
        state += self.position[0] * (self.size_y * 2)
        state += self.position[1] * (2)
        if self.item_in_car:
            state += 1
        return state

    def move_driver(self, action):
        (x, y) = self.item_position
        if action == 0: # south
            if y == 0:
                return -10, False
            else:
                self.item_position = (x, y-1)
                return -1, False
        elif action == 1: # north
            if y == self.size_y - 1:
                return -10, False
            else:
                self.item_position = (x, y+1)
                return -1, False
        elif action == 2: # east
            if x == self.size_x - 1:
                return -10, False
            else:
                self.item_position = (x+1, y)
                return -1, False
        elif action == 3: # west
            if x == 0:
                return -10, False
            else:
                self.item_position = (x-1, y)
                return -1, False
        elif action == 4: # pickup
            if self.item_in_car:
                return -10, False
            elif self.item_position != (x, y):
                return -10, False
            else:
                self.item_in_car = True
                return 20, False
        elif action == 5: # drop-off
            if not self.item_in_car:
                return -10, False
            elif self.item_drop_off != (x, y):
                self.item_position = (x, y)
                return -20, False
            else:
                return 20, True

Then we can generate our Q-table by iterating over the task 1000 times (it is just an arbitrary number I chose). As you see, it simply just runs over the task again and again, but updates the Q-table with the “learnings” based on the reward.

states = field.get_number_of_states()
actions = 6

q_table = np.zeros((states, actions))

alpha = 0.1
gamma = 0.6
epsilon = 0.1

for i in range(1000):
    field = Field(size, item_start, item_drop_off, start_position)
    done = False
    steps = 0
    while not done:
        state = field.get_state()
        if random.uniform(0, 1) < epsilon:
            action = random.randrange(0, 6)
        else:
            action = np.argmax(q_table[state])

        reward, done = field.move_driver(action)
        next_state = field.get_state()

        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])

        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
        q_table[state, action] = new_value

        steps += 1

After that we can use it, our Q-table is updated. To test it, we will run the same code again, just with the updated Q-table.

alpha = 0.1
gamma = 0.6
epsilon = 0.1

field = Field(size, item_start, item_drop_off, start_position)
done = False
steps = 0
while not done:
    state = field.get_state()
    if random.uniform(0, 1) < epsilon:
        action = random.randrange(0, 6)
    else:
        action = np.argmax(q_table[state])

    reward, done = field.move_driver(action)
    next_state = field.get_state()

    old_value = q_table[state, action]
    next_max = np.max(q_table[next_state])

    new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
    q_table[state, action] = new_value

    steps += 1

print(steps)

This resulted in 22 steps. That is awesome.

4 Easy Steps to Understand Unsupervised Machine Learning with an Example in Python

Step 1: Learn what is unsupervised machine learning?

An unsupervised machine learning model takes unlabelled (or categorised) data and lets the algorithm determined the answer for us.

Unsupervised Machine Learning model - takes unstructured data and finds patterns itself
Unsupervised Machine Learning model – takes unstructured data and finds patterns itself

The unsupervised machine learning model data without apparent structures and tries to identify some patterns itself to create categories.

Step 2: Understand the main types of unsupervised machine learning

There are two main types of unsupervised machine learning types.

  • Clustering: Is used for grouping data into categories without knowing any labels before hand.
  • Association: Is a rule-based for discovering interesting relations between variables in large databases.

In clustering the main algorithms used are K-means, hierarchy clustering, and hidden Markov model.

And in the association the main algorithm used are Apriori and FP-growth.

Step 3: How does K-means work

The K-means works in iterative steps

The k-means algorithm starts is an NP-hard problem, which mean there is no efficient way to solve in the general case. For this problem there are heuristics algorithms that converge fast to local optimum, which means you can find some optimum fast, but it might not be the best one, but often they can do just fine.

Enough, theory.

How does the algorithm work.

  • Step 1: Start by a set of k means. These can be chosen by taking k random point from the dataset (called the Random Partition initialisation method).
  • Step 2: Group each data point into the cluster of the nearest mean. Hence, each data point will be assigned to exactly one cluster.
  • Step 3: Recalculate the the means (also called centroids) to converge towards local optimum.

Steps 2 and 3 are repeated until the grouping in Step 2 does not change any more.

Step 4: A simple Python example with the k-means algorithm

In this example we are going to start assuming you have the basic knowledge how to install the needed libraries. If not, then see the following article.

First of, you need to import the needed libraries.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
from sklearn.cluster import KMeans

In the first basic example we are only going to plot some points on a graph.

style.use('ggplot')

x = [1, 2, 0.3, 9.2, 2.4,  9, 12]
y = [2, 4, 2.5, 8.5, 0.3, 11, 10]
plt.scatter(x, y)
plt.show()

The first line sets a style of the graph. Then we have the coordinates in the arrays x and y. This format is used to feed the scatter.

Output of the plot from scatter plotter in Python.
Output of the plot from scatter plotter in Python.

An advantage of plotting the points before you figure out how many clusters you want to use. Here it looks like there are two “groups” of plots, which translates into using to clusters.

To continue, we want to use the k means algorithm with two clusters.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
from sklearn.cluster import KMeans

style.use('ggplot')

x = [1, 2, 0.3, 9.2, 2.4,  9, 12]
y = [2, 4, 2.5, 8.5, 0.3, 11, 10]

# We need to transform the input coordinates to plot use the k means algorithm
X = []
for i in range(len(x)):
    X.append([x[i], y[i]])
X = np.array(X)

# The number of clusters
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
labels = kmeans.labels_

# Then we want to have different colors for each type.
colors = ['g.', 'r.']
for i in range(len(X)):
    # And plot them one at the time
    plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize=10)

# Plot the centres (or means)
plt.scatter(centroids[:, 0], centroids[:, 1], marker= "x", s=150, linewidths=5, zorder=10)
plt.show()

This results in the following result.

Example of k means algorithm used on simple dataset
Example of k means algorithm used on simple dataset

Considerations when using K-Means algorithm

We could have changed to use 3 clusters. That would have resulted in the following output.

Using 3 clusters instead of two in the k-mean algorithm
Using 3 clusters instead of two in the k-mean algorithm

This is not optimal for this dataset, but could be hard to predict without this visual representation of the dataset.

Uses of K-Means algorithm

Here are some interesting uses of the K-means algorithms:

  • Personalised marketing to users
  • Identifying fake news
  • Spam filter in your inbox

3 Easy Steps to Get Started With Machine Learning: Understand the Concept and Implement Linear Regression in Python

What will we cover in this article?

  • What is Machine Learning and how it can help you?
  • How does Machine Learning work?
  • A first example of Linear Regression in Python

Step 1: How can Machine Learning help you?

Machine Learning is a hot topic these days and it is easy to get confused when people talk about it. But what is Machine Learning and how can it you?

I found the following explanation quite good.

Classical vs modern (No machine learning vs machine learning) approach to predictions.
Classical vs modern (No machine learning vs machine learning) approach to predictions.

In the classical computing model every thing is programmed into the algorithms. This has the limitation that all decision logic need to be understood before usage. And if things change, we need to modify the program.

With the modern computing model (Machine Learning) this paradigm is changes. We feed the algorithms with data, and based on that data, we do the decisions in the program.

While this can seem abstract, this is a big change in thinking programming. Machine Learning has helped computers to have solutions to problems like:

  • Improved search engine results.
  • Voice recognition.
  • Number plate recognition.
  • Categorisation of pictures.
  • …and the list goes on.

Step 2: How does Machine Learning work?

I’m glad you asked. I was wondering about that myself.

On a high level you can divide Machine Learning into two phases.

  • Phase 1: Learning
  • Phase 2: Prediction

The Learning phase is divided into steps.

Machine Learning: The Learning Phase: Training data, Pre-processing, Learning, Testing
Machine Learning: The Learning Phase: Training data, Pre-processing, Learning, Testing

It all starts with a training set (training data). This data set should represent the type of data that the Machine Learn model should be used to predict from in Phase 2 (predction).

The pre-processing step is about cleaning up data. While the Machine Learning is awesome, it cannot figure out what good data looks like. You need to do the cleaning as well as transforming data into a desired format.

Then for the magic, the learning step. There are three main paradigms in machine learning.

  • Supervised: where you tell the algorithm what categories each data item is in. Each data item from the training set is tagged with the right answer.
  • Unsupervised: is when the learning algorithm is not told what to do with it and it should make the structure itself.
  • Reinforcement: teaches the machine to think for itself based on past action rewards.

Finally, the testing is done to see if the model is good. The training data was divided into a test set and training set. The test set is used to see if the model can predict from it. If not, a new model might be necessary.

After that the Prediction Phase begins.

How Machine Learning predicts new data.
How Machine Learning predicts new data.

When the model has been created it will be used to predict based on it from new data.

Step 3: For our first example of Linear Regression in Python

Installing the libraries

Linear regression is a linear approach to modelling the relationship between a scalar response to one or more variables. In the case we try to model, we will do it for one single variable. Said in another way, we want map points on a graph to a line (y = a*x + b).

To do that, we need to import various libraries.

# Importing matplotlib to make a plot
import matplotlib.pyplot as plt
# work with number as array
import numpy as np
# we want to use linear_model (that uses datasets)
from sklearn import linear_model

The matplotlib library is used to make a plot, but is a comprehensive library for creating static, animated, and interactive visualizations in Python. If you do not have it installed you can do that by typing in the following command in a terminal.

pip install matplotlib

The numpy is a powerful library to calculate with N-dimensional arrays. If needed, you can install it by typing the following command in a terminal.

pip install numpy

Finally, you need the linear_model from the sklearn library, which you can install by typing the following command in a terminal.

pip install scikit-learn

Training data set

This simple example will let you make a linear regression of an input of the following data set.

# data set
prices = [245, 312, 279, 308, 199, 409, 200, 400, 230]
size = [50, 60, 35, 55, 30, 65, 30, 75, 25]

Here some items are sold, but each item has a size. The first item was sold for 245 ($) and had a size of 50 (something). The next item was sold to 312 ($) and had a size of 60 (something).

The sizes needs to be reshaped before we model it.

# Importing matplotlib and numpy and sklearn
import matplotlib.pyplot as plt
# work with number as array
import numpy as np
# we want to use linear_model (that uses datasets)
from sklearn import linear_model

# data set
prices = [245, 312, 279, 308, 199, 409, 200, 400, 230]
size = [50, 60, 35, 55, 30, 65, 30, 75, 25]

# reshape the input for regression ( second argument how many items
size2 = np.array(size).reshape((-1, 1))
print(size2)

Which results in the following output.

[[50]
 [60]
 [35]
 [55]
 [30]
 [65]
 [30]
 [75]
 [25]]

Hence, the reshape((-1, 1)) transforms it from a row to a single array.

Then for the linear regression.

# Importing matplotlib and numpy and sklearn
import matplotlib.pyplot as plt
# work with number as array
import numpy as np
# we want to use linear_model (that uses datasets)
from sklearn import linear_model

# data set
prices = [245, 312, 279, 308, 199, 409, 200, 400, 230]
size = [50, 60, 35, 55, 30, 65, 30, 75, 25]

# reshape the input for regression ( second argument how many items
size2 = np.array(size).reshape((-1, 1))
print(size2)

regr = linear_model.LinearRegression()
regr.fit(size2, prices)
print("Coefficients", regr.coef_)
print("intercepts", regr.intercept_)

Which prints out the coefficient (a) and the intercept (b) of a formula y = a*x + b.

Now you can predict future prices, when given a size.

# How to predict
size_new = 60
price = size_new * regr.coef_ + regr.intercept_
print(price)
print(regr.predict([[size_new]]))

Where you both can compute it directly (2nd line) or use the regression model (4th line).

Finally, you can plot the linear regression as a graph.

# Importing matplotlib and numpy and sklearn
import matplotlib.pyplot as plt
# work with number as array
import numpy as np
# we want to use linear_model (that uses datasets)
from sklearn import linear_model

# data set
prices = [245, 312, 279, 308, 199, 409, 200, 400, 230]
size = [50, 60, 35, 55, 30, 65, 30, 75, 25]

# reshape the input for regression ( second argument how many items
size2 = np.array(size).reshape((-1, 1))
print(size2)

regr = linear_model.LinearRegression()
regr.fit(size2, prices)

# Here we plot the graph
x = np.array(range(20, 100))
y = eval('regr.coef_*x + regr.intercept_')
plt.plot(x, y)
plt.scatter(size, prices, color='black')
plt.ylabel('prices')
plt.xlabel('size')
plt.show()

Which results in the following graph.

Example of linear regression in Python
Example of linear regression in Python

Conclusion

This is obviously a simple example of linear regression, as it only has one variable. This simple example shows you how to setup the environment in Python and how to make a simple plot.

A Simple 7 Step Guide to Implement a Prediction Model to Filter Tweets Based on Dataset Interactively Read from Twitter

What will we learn in this tutorial

  • How Machine Learning works and predicts.
  • What you need to install to implement your Prediction Model in Python
  • A simple way to implement a Prediction Model in Python with persistence
  • How to simplify the connection to the Twitter API using tweepy
  • Collect the training dataset from twitter interactively in a Python program
  • Use the persistent model to predict the tweets you like

Step 1: Quick introduction to Machine Learning

Machine Learning: Input to Learner is Features X (data set) with Targets Y. The Learner outputs a Model, which can predict (Y) future inputs (X).
Machine Learning: Input to Learner is Features X (data set) with Targets Y. The Learner outputs a Model, which can predict (Y) future inputs (X).
  • The Leaner (or Machine Learning Algorithm) is the program that creates a machine learning model from the input data.
  • The Features X is the dataset used by the Learner to generate the Model.
  • The Target Y contains the categories for each data item in the Feature X dataset.
  • The Model takes new inputs X (similar to those in Features) and predicts a target Y, from the categories in Target Y.

We will implement a simple model, that can predict Twitter feeds into two categories: allow and refuse.

Step 2: Install sklearn library (skip if you already have it)

The Python code will be using the sklearn library.

You can install it, simply write the following in the command line (also see here).

pip install scikit-learn

Alternatively, you might want to install it locally in your user space.

pip install scikit-learn --user

Step 3: Create a simple Prediction Model in Python to Train and Predict on tweets

The implementation accomplishes the the machine learning model in a class. The class has the following features.

  • create_dataset: It creates a dataset by taking a list of data that are representing allow, and a list of data that represent the reject. The dataset is divided into features and targets
  • train_dataset: When your dataset is loaded it should be trained to create the model, consisting of the predictor (transfer and estimator)
  • predict: Is called after the model is trained. It can predict an input if it is in the allow category.
  • persist: Is called to save the model for later use, such that we do not need to collect data and train it again. It should only be called after dataset has been created and the model has been train (after create_dataset and train_dataset)
  • load: This will load a saved model and be ready to predict new input.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import joblib


class PredictionModel:
    def __init__(self):
        self.predictor = {}
        self.dataset = {'features': [], 'targets': []}
        self.allow_id = 0
        self.reject_id = 1

    def create_dataset(self, allow_data, reject_data):
        features_y = allow_data + reject_data
        targets_x = [self.allow_id]*len(allow_data) + [self.reject_id]*len(reject_data)
        self.dataset = {'features': features_y, 'targets': targets_x}

    def train_dataset(self):
        x_train, x_test, y_train, y_test = train_test_split(self.dataset['features'], self.dataset['targets'])

        transfer = TfidfVectorizer()
        x_train = transfer.fit_transform(x_train)
        x_test = transfer.transform(x_test)

        estimator = MultinomialNB()
        estimator.fit(x_train, y_train)

        score = estimator.score(x_test, y_test)
        self.predictor = {'transfer': transfer, 'estimator': estimator}

    def predict(self, text):
        sentence_x = self.predictor['transfer'].transform()
        y_predict = self.predictor['estimator'].predict(sentence_x)
        return y_predict[0] == self.allow_id

    def persist(self, output_name):
        joblib.dump(self.predictor['transfer'], output_name+".transfer")
        joblib.dump(self.predictor['estimator'], output_name+".estimator")

    def load(self, input_name):
        self.predictor['transfer'] = joblib.load(input_name+'.transfer')
        self.predictor['estimator'] = joblib.load(input_name+'.estimator')

Step 4: Get a Twitter API access

Go to https://developer.twitter.com/en and get your consumer_key, consumer_secret, access_token, and access_token_secret.

api_key = {
    'consumer_key': "",
    'consumer_secret': "",
    'access_token': "",
    'access_token_secret': ""
}

Also see here for a deeper tutorial on how to get them if in doubt.

Step 5: Simplify your Twitter connection

If you do not already have the tweepy library, then install it by.

pip install tweepy

As you will only read tweets from users, the following class will help you to simplify your code.

import tweepy


class TwitterConnection:
    def __init__(self, api_key):
        # authentication of consumer key and secret
        auth = tweepy.OAuthHandler(api_key['consumer_key'], api_key['consumer_secret'])

        # authentication of access token and secret
        auth.set_access_token(api_key['access_token'], api_key['access_token_secret'])
        self.api = tweepy.API(auth)

    def get_tweets(self, user_name, number=0):
        if number > 0:
            return tweepy.Cursor(self.api.user_timeline, screen_name=user_name, tweet_mode="extended").items(number)
        else:
            return tweepy.Cursor(self.api.user_timeline, screen_name=user_name, tweet_mode="extended").items()
  • __init__: The class sets up the Twitter API in the init-function.
  • get_tweets: Returns the tweets from a user_name (screen_name).

Step 6: Collect the dataset (Features X and Target Y) from Twitter

To simplify your life you will use the above TwitterConnection class and and PredictionModel class.

def get_features(auth, user_name, output_name):
    positives = []
    negatives = []
    twitter_con = TwitterConnection(auth)
    tweets = twitter_con.get_tweets(user_name)
    for tweet in tweets:
        print(tweet.full_text)
        print("a/r/e (allow/reject/end)? ", end='')
        response = input()
        if response.lower() == 'y':
            positives.append(tweet.full_text)
        elif response.lower() == 'e':
            break
        else:
            negatives.append(tweet.full_text)
    model = PredictionModel()
    model.create_dataset(positives, negatives)
    model.train_dataset()
    model.persist(output_name)

The function reads the tweets from user_name and prompts for each one of them whether it should be added to tweets you allow or reject.

When you do not feel like “training” your set more (i.e. collect more training data), then you can press e.

Then it will create the dataset and train it to finally persist it.

Step 7: See how good it predicts your tweets based on your model

The following code will print the first number tweets that your model will allow by user_name.

def fetch_tweets_prediction(auth, user_name, input_name, number):
    model = PredictionModel()
    model.load(input_name)
    twitter_con = TwitterConnection(auth)
    tweets = twitter_con.get_tweets(user_name)
    for tweet in tweets:
        if model.predict(tweet.full_text):
            print(tweet.full_text)
            number -= 1
        if number < 0:
            break

Then your final piece is to call it. Remember to fill out your values for the api_key.

api_key = {
    'consumer_key': "",
    'consumer_secret': "",
    'access_token': "",
    'access_token_secret': ""
}

get_features(api_key, "@cnnbrk", "cnnbrk")
fetch_tweets_prediction(api_key, "@cnnbrk", "cnnbrk", 10)

Conclusion

I trained my set by 30-40 tweets with the above code. From the training set it did not have any false positives (that is an allow which was a reject int eh dataset), but it did have false rejects.

The full code is here.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import joblib
import tweepy


class PredictionModel:
    def __init__(self):
        self.predictor = {}
        self.dataset = {'features': [], 'targets': []}
        self.allow_id = 0
        self.reject_id = 1

    def create_dataset(self, allow_data, reject_data):
        features_y = allow_data + reject_data
        targets_x = [self.allow_id]*len(allow_data) + [self.reject_id]*len(reject_data)
        self.dataset = {'features': features_y, 'targets': targets_x}

    def train_dataset(self):
        x_train, x_test, y_train, y_test = train_test_split(self.dataset['features'], self.dataset['targets'])

        transfer = TfidfVectorizer()
        x_train = transfer.fit_transform(x_train)
        x_test = transfer.transform(x_test)

        estimator = MultinomialNB()
        estimator.fit(x_train, y_train)

        score = estimator.score(x_test, y_test)
        self.predictor = {'transfer': transfer, 'estimator': estimator}

    def predict(self, text):
        sentence_x = self.predictor['transfer'].transform()
        y_predict = self.predictor['estimator'].predict(sentence_x)
        return y_predict[0] == self.allow_id

    def persist(self, output_name):
        joblib.dump(self.predictor['transfer'], output_name+".transfer")
        joblib.dump(self.predictor['estimator'], output_name+".estimator")

    def load(self, input_name):
        self.predictor['transfer'] = joblib.load(input_name+'.transfer')
        self.predictor['estimator'] = joblib.load(input_name+'.estimator')


class TwitterConnection:
    def __init__(self, api_key):
        # authentication of consumer key and secret
        auth = tweepy.OAuthHandler(api_key['consumer_key'], api_key['consumer_secret'])

        # authentication of access token and secret
        auth.set_access_token(api_key['access_token'], api_key['access_token_secret'])
        self.api = tweepy.API(auth)

    def get_tweets(self, user_name, number=0):
        if number > 0:
            return tweepy.Cursor(self.api.user_timeline, screen_name=user_name, tweet_mode="extended").items(number)
        else:
            return tweepy.Cursor(self.api.user_timeline, screen_name=user_name, tweet_mode="extended").items()


def get_features(auth, user_name, output_name):
    positives = []
    negatives = []
    twitter_con = TwitterConnection(auth)
    tweets = twitter_con.get_tweets(user_name)
    for tweet in tweets:
        print(tweet.full_text)
        print("y/n/e (positive/negative/end)? ", end='')
        response = input()
        if response.lower() == 'y':
            positives.append(tweet.full_text)
        elif response.lower() == 'e':
            break
        else:
            negatives.append(tweet.full_text)
    model = PredictionModel()
    model.create_dataset(positives, negatives)
    model.train_dataset()
    model.persist(output_name)


def fetch_tweets_prediction(auth, user_name, input_name, number):
    model = PredictionModel()
    model.load(input_name)
    twitter_con = TwitterConnection(auth)
    tweets = twitter_con.get_tweets(user_name)
    for tweet in tweets:
        if model.predict(tweet.full_text):
            print("POS", tweet.full_text)
            number -= 1
        else:
            pass
            # print("NEG", tweet.full_text)
        if number < 0:
            break

api_key = {
    'consumer_key': "_",
    'consumer_secret': "_",
    'access_token': "_-_",
    'access_token_secret': "_"
}

get_features(api_key, "@cnnbrk", "cnnbrk")
fetch_tweets_prediction(api_key, "@cnnbrk", "cnnbrk", 10)

How To Get Started with a Predictive Machine Learning Program in Python in 5 Easy Steps

What will you learn?

  • How to predict from a dataset with Machine Learning
  • How to implement that in Python
  • How to get data from Twitter
  • How to install the necessary libraries to do Machine Learning in Python

Step 1: Install the necessary libraries

The sklearn library is a simple and efficient tools for predictive data analysis.

You can install it by typing in the following in your command line.

pip install sklearn

It will most likely install a couple of more needed libraries.

Collecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Collecting scikit-learn
  Downloading scikit_learn-0.23.1-cp38-cp38-macosx_10_9_x86_64.whl (7.2 MB)
     |████████████████████████████████| 7.2 MB 5.0 MB/s 
Collecting numpy>=1.13.3
  Downloading numpy-1.18.4-cp38-cp38-macosx_10_9_x86_64.whl (15.2 MB)
     |████████████████████████████████| 15.2 MB 12.6 MB/s 
Collecting joblib>=0.11
  Downloading joblib-0.15.1-py3-none-any.whl (298 kB)
     |████████████████████████████████| 298 kB 8.1 MB/s 
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-2.1.0-py3-none-any.whl (12 kB)
Collecting scipy>=0.19.1
  Downloading scipy-1.4.1-cp38-cp38-macosx_10_9_x86_64.whl (28.8 MB)
     |████████████████████████████████| 28.8 MB 5.8 MB/s 
Using legacy setup.py install for sklearn, since package 'wheel' is not installed.
Installing collected packages: numpy, joblib, threadpoolctl, scipy, scikit-learn, sklearn
    Running setup.py install for sklearn ... done
Successfully installed joblib-0.15.1 numpy-1.18.4 scikit-learn-0.23.1 scipy-1.4.1 sklearn-0.0 threadpoolctl-2.1.0

As in my installation with numpy, joblib, threadpoolctl, scipy, and scikit-learn.

Step 2: The dataset

The machine learning algorithm needs a dataset to train on. To make this tutorial simple, I only used a limited set. I looked through the top tweets from CNN Breaking and categorised them in positive and negative tweets (I know it can be subjective).

negative = [
    "Protesters who were marching from Minneapolis to St. Paul were tear gassed by police as they tried to cross the Lake Street Marshall Bridge ",
    "The National Guard has been activated in Washington, D.C. to assist police handling protests around the White House",
    "Police have been firing tear gas at the protesters near the 5th Precinct in Minneapolis, where some in the crowd have responded with projectiles of their own",
    "Texas and Colorado have activated the National Guard respond to protests",
    "The mayor of Rochester, New York, has declared a state of emergency and ordered a curfew from 9 p.m. Saturday to 7 a.m. Sunday",
    "Cleveland, Ohio, has enacted a curfew that will go into effect at 8 p.m. Saturday and last through 8 a.m. Sunday",
    "A police car appears to be on fire in Los Angeles. Police officers are holding back a line of demonstrators to prevent them from getting close to the car."
            ]

positive = [
    "Two NASA astronauts make history with their successful launch into space aboard a SpaceX rocket",
    "After questionable weather, officials give the all clear for the SpaceX launch",
    "NASA astronauts Bob Behnken and Doug Hurley climb aboard SpaceX's Crew Dragon spacecraft as they prepare for a mission to the International Space Station",
    "New York Gov. Andrew Cuomo signs a bill giving death benefits to families of frontline workers who died battling the coronavirus pandemic"
]

Step 3: Train the model

The data needs to be categorised to be fed into the training algorithm. Hence, we will make the required structure of the data set.

def prepare_data(positive, negative):
    data = positive + negative
    target = [0]*len(positive) + [1]*len(negative)
    return {'data': data, 'target': target}

The actual training is done by using the sklearn library.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

def train_data_set(data_set):
    x_train, x_test, y_train, y_test = train_test_split(data_set['data'], data_set['target'])

    transfer = TfidfVectorizer()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    estimator = MultinomialNB()
    estimator.fit(x_train, y_train)

    score = estimator.score(x_test, y_test)
    print("score:\n", score)
    return {'transfer': transfer, 'estimator': estimator}

Step 4: Get some tweets from CNN Breaking and predict

In order for this step to work you need to set up tokens for the twitter api. You can follow this tutorial in order to do that.

When you have that you can use the following code to get it running.

import tweepy


def setup_twitter():
    consumer_key = "REPLACE WITH YOUR KEY"
    consumer_secret = "REPLACE WITH YOUR SECRET"
    access_token = "REPLACE WITH YOUR TOKEN"
    access_token_secret = "REPLACE WITH YOUR TOKEN SECRET"

    # authentication of consumer key and secret
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

    # authentication of access token and secret
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth)
    return api


def mood_on_cnn(api, predictor):
    stat = [0, 0]
    for status in tweepy.Cursor(api.user_timeline, screen_name='@cnnbrk', tweet_mode="extended").items():
        sentence_x = predictor['transfer'].transform([status.full_text])
        y_predict = predictor['estimator'].predict(sentence_x)

        stat[y_predict[0]] += 1

    return stat

Step 5: Putting it all together

That is it.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import tweepy


negative = [
    "Protesters who were marching from Minneapolis to St. Paul were tear gassed by police as they tried to cross the Lake Street Marshall Bridge ",
    "The National Guard has been activated in Washington, D.C. to assist police handling protests around the White House",
    "Police have been firing tear gas at the protesters near the 5th Precinct in Minneapolis, where some in the crowd have responded with projectiles of their own",
    "Texas and Colorado have activated the National Guard respond to protests",
    "The mayor of Rochester, New York, has declared a state of emergency and ordered a curfew from 9 p.m. Saturday to 7 a.m. Sunday",
    "Cleveland, Ohio, has enacted a curfew that will go into effect at 8 p.m. Saturday and last through 8 a.m. Sunday",
    "A police car appears to be on fire in Los Angeles. Police officers are holding back a line of demonstrators to prevent them from getting close to the car."
            ]

positive = [
    "Two NASA astronauts make history with their successful launch into space aboard a SpaceX rocket",
    "After questionable weather, officials give the all clear for the SpaceX launch",
    "NASA astronauts Bob Behnken and Doug Hurley climb aboard SpaceX's Crew Dragon spacecraft as they prepare for a mission to the International Space Station",
    "New York Gov. Andrew Cuomo signs a bill giving death benefits to families of frontline workers who died battling the coronavirus pandemic"
]


def prepare_data(positive, negative):
    data = positive + negative
    target = [0]*len(positive) + [1]*len(negative)
    return {'data': data, 'target': target}


def train_data_set(data_set):
    x_train, x_test, y_train, y_test = train_test_split(data_set['data'], data_set['target'])

    transfer = TfidfVectorizer()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    estimator = MultinomialNB()
    estimator.fit(x_train, y_train)

    score = estimator.score(x_test, y_test)
    print("score:\n", score)
    return {'transfer': transfer, 'estimator': estimator}


def setup_twitter():
    consumer_key = "REPLACE WITH YOUR KEY"
    consumer_secret = "REPLACE WITH YOUR SECRET"
    access_token = "REPLACE WITH YOUR TOKEN"
    access_token_secret = "REPLACE WITH YOUR TOKEN SECRET"

    # authentication of consumer key and secret
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

    # authentication of access token and secret
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth)
    return api


def mood_on_cnn(api, predictor):
    stat = [0, 0]
    for status in tweepy.Cursor(api.user_timeline, screen_name='@cnnbrk', tweet_mode="extended").items():
        sentence_x = predictor['transfer'].transform([status.full_text])
        y_predict = predictor['estimator'].predict(sentence_x)

        stat[y_predict[0]] += 1

    return stat


data_set = prepare_data(positive, negative)
predictor = train_data_set(data_set)

api = setup_twitter()
stat = mood_on_cnn(api, predictor)

print(stat)
print("Mood (0 good, 1 bad)", stat[1]/(stat[0] + stat[1]))

I got the following output on the day of writing this tutorial.

score:
 1.0
[751, 2455]
Mood (0 good, 1 bad) 0.765751715533375

I found that the breaking news items are quite negative in taste. Hence, it seems to predict that.