How to Create a Sentiment Analysis model to Predict the Mood of Tweets with Python – 4 Steps to Compare the Mood of Python vs Java

What will we cover in this tutorial?

  • We will learn how the supervised Machine Learning algorithm Sentiment Analysis can be used on twitter data (also, called tweets).
  • The model we use will be Naive Bayes Classifier.
  • The tutorial will help install the necessary Python libraries to get started and how to download training data.
  • Then it will give you a full script to train the model.
  • Finally, we will use the trained model to compare the “mood” of Python with Java.

Step 1: Install the Natural Language Toolkit Library and Download Collections

We will use the Natural Language Toolkit (nltk) library in this tutorial.

NLTK is a leading platform for building Python programs to work with human language data.

http://www.nltk.org

To install the library you should run the following command in a terminal or see here for other alternatives.

pip install nltk

To have the data available that you need to run the following program or see installing NLTK Data.

import nltk
nltk.download()

This will prompt you with a screen similar to this. And select all packages you want to install (I took them all).

Download all packages to NLTK (Natural Language Toolkit)
Download all packages to NLTK (Natural Language Toolkit)

After download you can use the twitter_samples as you need in the example.

Step 2: Reminder of the Sentiment Analysis learning process (Machine Learning)

On a high level you can divide Machine Learning into two phases.

  • Phase 1: Learning
  • Phase 2: Prediction

The Sentiment Analysis model is supervised learning process. The process is defined in the picture below.

The Sentiment Analysis model (Machine Learning) Learning phase
The Sentiment Analysis model (Supervised Machine Learning) Learning phase

On a high level the the learning process of Sentiment Analysis model has the following steps.

  • Training & test data
    • The Sentiment Analysis model is a supervised learning and needs data representing the data that the model should predict. We will use tweets.
    • The data should be categorized into the groups it should be able to distinguish. In our example it will be in positive tweets and negative tweets.
  • Pre-processing
    • First you need to remove “noise”. In our case we remove URL links and Twitter user names.
    • Then you Lemmatize the data to have the words in the same form.
    • Further, you remove stop words as they have no impact of the mood in the tweet.
    • The data then needs to be formatted for the algorithm.
    • Finally, you need to divide it into a training data and testing data.
  • Learning
    • This is where the algorithm builds the model using the training data.
  • Testing
    • Then we test the accuracy of the model with the categorized test data.

Step 3: Train the Sample Data

The twitter_sample contains 5000 positive and 5000 negative tweets, all ready and classified to use in for your training model.

import random
import pickle

from nltk.corpus import twitter_samples
from nltk.stem import WordNetLemmatizer
from nltk.tag import pos_tag
from nltk.corpus import stopwords
from nltk import NaiveBayesClassifier
from nltk import classify


def clean_data(token):
    return [item for item in token if not item.startswith("http") and not item.startswith("@")]


def lemmatization(token):
    lemmatizer = WordNetLemmatizer()

    result = []
    for token, tag in pos_tag(token):
        tag = tag[0].lower()
        token = token.lower()
        if tag in "nva":
            result.append(lemmatizer.lemmatize(token, pos=tag))
        else:
            result.append(lemmatizer.lemmatize(token))
    return result


def remove_stop_words(token, stop_words):
    return [item for item in token if item not in stop_words]


def transform(token):
    result = {}
    for item in token:
        result[item] = True
    return result


def main():
    # Step 1: Gather data
    positive_tweets_tokens = twitter_samples.tokenized('positive_tweets.json')
    negative_tweets_tokens = twitter_samples.tokenized('negative_tweets.json')

    # Step 2: Clean, Lemmatize, and remove Stop Words
    stop_words = stopwords.words('english')
    positive_tweets_tokens_cleaned = [remove_stop_words(lemmatization(clean_data(token)), stop_words) for token in positive_tweets_tokens]
    negative_tweets_tokens_cleaned = [remove_stop_words(lemmatization(clean_data(token)), stop_words) for token in negative_tweets_tokens]

    # Step 3: Transform data
    positive_tweets_tokens_transformed = [(transform(token), "Positive") for token in positive_tweets_tokens_cleaned]
    negative_tweets_tokens_transformed = [(transform(token), "Negative") for token in negative_tweets_tokens_cleaned]


    # Step 4: Create data set
    dataset = positive_tweets_tokens_transformed + negative_tweets_tokens_transformed
    random.shuffle(dataset)

    train_data = dataset[:7000]
    test_data = dataset[7000:]

    # Step 5: Train data
    classifier = NaiveBayesClassifier.train(train_data)

    # Step 6: Test accuracy
    print("Accuracy is:", classify.accuracy(classifier, test_data))
    print(classifier.show_most_informative_features(10))

    # Step 7: Save the pickle
    f = open('my_classifier.pickle', 'wb')
    pickle.dump(classifier, f)
    f.close()


if __name__ == "__main__":
    main()

The code is structured in steps. If you are not comfortable how a the flow of a general machine learning flow is, I can recommend to read this tutorial here or this one.

  • Step 1: Collect and categorize It reads the 5000 positive and 5000 negative twitter samples we downloaded with the nltk.download() call.
  • Step 2: The data needs to be cleaned, Lemmatized and removed for stop words.
    • The clean_data call removes links and twitter users.
    • The call to lemmatization puts words in their base form.
    • The call to remove_stop_words removes all the stop words that have no affect on the mood of the sentence.
  • Step 3: Format data This step transforms the data to the desired format for the NaiveBayesClassifier module.
  • Step 4: Divide data Creates the full data set. Makes a shuffle to take them in different order. Then takes 70% as training data and 30% test data.
    • This data is mixed different from run to run. Hence, it might happen that you will not get the same accuracy like I will in my run.
    • The training data is used to make the model to predict from.
    • The test data is used to compute the accuracy of the model to predict.
  • Step 5: Training model This is the training of the NaiveBayesClassifier model.
    • This is where all the magic happens.
  • Step 6: Accuracy This is testing the accuracy of the model.
  • Step 7: Persist To save the model for use.

I got the following output from the above program.

Accuracy is: 0.9973333333333333
Most Informative Features
                      :) = True           Positi : Negati =   1010.7 : 1.0
                     sad = True           Negati : Positi =     25.4 : 1.0
                     bam = True           Positi : Negati =     20.2 : 1.0
                  arrive = True           Positi : Negati =     18.3 : 1.0
                     x15 = True           Negati : Positi =     17.2 : 1.0
               community = True           Positi : Negati =     14.7 : 1.0
                    glad = True           Positi : Negati =     12.6 : 1.0
                   enjoy = True           Positi : Negati =     12.0 : 1.0
                    kill = True           Negati : Positi =     12.0 : 1.0
                     ugh = True           Negati : Positi =     11.3 : 1.0
None

Step 4: Use the Sentiment Analysis prediction model

Now we can determine the mood of a tweet. To have some fun let us try to figure out the mood of tweets with Python and compare it with Java.

To do that, you need to have setup your twitter developer account. If you do not have that already, then see the this tutorial on how to do that.

In the code below you need to fill out your consumer_key, consumer_secret, access_token, and access_token_secret.

import pickle
import tweepy


def get_twitter_api():
    # personal details
    consumer_key = "___INSERT YOUR DATA HERE___"
    consumer_secret = "___INSERT YOUR DATA HERE___"
    access_token = "___INSERT YOUR DATA HERE___"
    access_token_secret = "___INSERT YOUR DATA HERE___"

    # authentication of consumer key and secret
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

    # authentication of access token and secret
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth)
    return api


# This function uses the functions from the learner code above
def tokenize(tweet):
    return remove_noise(word_tokenize(tweet))


def get_classifier(pickle_name):
    f = open(pickle_name, 'rb')
    classifier = pickle.load(f)
    f.close()
    return classifier


def find_mood(search):
    classifier = get_classifier('my_classifier.pickle')

    api = get_twitter_api()

    stat = {
        "Positive": 0,
        "Negative": 0
    }
    for tweet in tweepy.Cursor(api.search, q=search).items(1000):
        custom_tokens = tokenize(tweet.text)

        category = classifier.classify(dict([token, True] for token in custom_tokens))
        stat[category] += 1

    print("The mood of", search)
    print(" - Positive", stat["Positive"], round(stat["Positive"]*100/(stat["Positive"] + stat["Negative"]), 1))
    print(" - Negative", stat["Negative"], round(stat["Negative"]*100/(stat["Positive"] + stat["Negative"]), 1))


if __name__ == "__main__":
    find_mood("#java")
    find_mood("#python")

That is it. Obviously the mood of Python is better. It is easier than Java.

The mood of #java
 - Positive 524 70.4
 - Negative 220 29.6
The mood of #python
 - Positive 753 75.3
 - Negative 247 24.7

If you want to learn more about Python I can encourage you to take my course here.

5 Steps to Master the Reinforcement Learning with a Q-Learning Python Example

What will we learn in this article?

The Q-Learning algorithm is a nice and easy to understand algorithm used with Reinforcement Learning paradigm in Machine Learning. It can be implemented from scratch and we will do that in this article.

After you go through this article you will know what Reinforcement Learning is, the main types of algorithm used, fully understand the Q-learning algorithm and implement an awesome example from scratch in Python.

The steps towards that are.

  • Learn and understand what reinforcement learning in machine learning?
  • What are the main algorithm in reinforcement learning?
  • Deep dive to understand the Q-learning algorithm
  • Implement a task that we want the Q-learning algorithm to learn – first we let a random choices try (1540 steps on average).
  • Then we implement the Q-learning algorithm from scratch and let it solve learn how to solve it (22 steps).

Step 1: What is Reinforcement Learning?

Reinforcement learning teaches the machine to think for itself based on past action rewards.

Reinforcement Learning (in Machine Learning) teaches the machine to think based on past action rewards.
Reinforcement Learning (in Machine Learning) teaches the machine to think based on past action rewards.

Basically, the Reinforcement Learning algorithm tries to predict actions that gives rewards and avoids punishment.

It is like training a dog. You and the dog do not talk the same language, but the dogs learns how to act based on rewards (and punishment, which I do not advise or advocate).

Hence, if a dog is rewarded for a certain action in a given situation, then next time it is exposed to a similar situation it will act the same.

Translate that to Reinforcement Learning.

  • The agent is the dog that is exposed to the environment.
  • Then the agent encounters a state.
  • The agent performs an action to transition from that state to a new state.
  • Then after the transition the agent receives a reward or penalty (punishment).
  • This forms a policy to create a strategy to choose actions in a given state.

Step 2: What are the algorithm used for Reinforcement Learning?

The most common algorithm for Reinforcement Learning are.

We will focus on the Q-learning algorithm as it is easy to understand as well as powerful.

Step 3: Understand the Q-Learning algorithm

As already noted, I just love this algorithm. It is “easy” to understand and seems very powerful.

Q-Learning algorithm (Reinforcement / Machine Learning) - exploit or explore - Update Q-table
Q-Learning algorithm (Reinforcement / Machine Learning) – exploit or explore – Update Q-table

The Q-Learning algorithm has a Q-table (a Matrix of dimension state x actions – don’t worry if you do not understand what a Matrix is, you will not need the mathematical aspects of it – it is just an indexed “container” with numbers).

  • The agent (or Q-Learning algorithm) will be in a state.
  • Then in each iteration the agent needs take an action.
  • The agent will continuously update the reward in the Q-table.
  • The learning can come from either exploiting or exploring.

This translates into the following pseudo algorithm for the Q-Learning.

The agent is in a given state and needs to choose an action.

  • Initialise the Q-table to all zeros
  • Iterate:
    • Agent is in state state.
    • With probability epsilon choose to explore, else exploit.
      • If explore, then choose a random action.
      • If exploit, then choose the best action based on the current Q-table.
    • Update the Q-table from the new reward to the previous state.
      • Q[state, action] = (1 – alpha) * Q[state, action] + alpha * (reward + gamma * max(Q[new_state]) — Q[state, action])

As you can se, we have introduced the following variables.

  • epsilon: the probability to take a random action, which is done to explore new territory.
  • alpha: is the learning rate that the algorithm should make in each iteration and should be in the interval from 0 to 1.
  • gamma: is the discount factor used to balance the immediate and future reward. This value is usually between 0.8 and 0.99
  • reward: is the feedback on the action and can be any number. Negative is penalty (or punishment) and positive is a reward.

Step 4: A task we want the Q-learning algorithm to master

We need to test and understand our the above algorithm. So far, it is quite abstract. To do that we will create a simple task to show how the Q-learning algorithm will solve it efficient by learning by rewards.

To keep it simple, we create a field of size 10×10 positions. In that field there is an item that needs to be picket up and moved to a drop-off point.

At each position there are 6 different actions that can be taken.

  • Action 0: Go south if on field.
  • Action 1: Go north if on field.
  • Action 2: Go east if on field.
  • Action 3: Go west if on field.
  • Action 4: Pickup item (it can try even if it is not there)
  • Action 5: Drop-off item (it can try even if it does not have it)

Based on these action we will make a reward system.

  • If the agent tries to go off the field, punish with -10 in reward.
  • If the agent makes a (legal) move, punish with -1 in reward, as we do not want to encourage endless walking around.
  • If the agent tries to pick up item, but it is not there or it has it already, punish with 10.
  • If the agent picks up the item correct place, reward with 20.
  • If agent tries to drop-off item in wrong place or does not have the item, punish with 10.
  • If the agent drops-off item in correct place, reward with 20.

That translates into the following code. I prefer to implement this code, as I think the standard libraries that provide similar frameworks hide some important details. As an example, and shown later, how do you map this into a state in the Q-table?

class Field:
    def __init__(self, size, item_pickup, item_drop_off, start_position):
        self.size_x = size
        self.size_y = size
        self.item_in_car = False
        self.item_position = item_pickup
        self.item_drop_off = item_drop_off
        self.position = start_position

    def move_driver(self, action):
        (x, y) = self.item_position
        if action == 0: # south
            if y == 0:
                return -10, False
            else:
                self.item_position = (x, y-1)
                return -1, False
        elif action == 1: # north
            if y == self.size_y - 1:
                return -10, False
            else:
                self.item_position = (x, y+1)
                return -1, False
        elif action == 2: # east
            if x == self.size_x - 1:
                return -10, False
            else:
                self.item_position = (x+1, y)
                return -1, False
        elif action == 3: # west
            if x == 0:
                return -10, False
            else:
                self.item_position = (x-1, y)
                return -1, False
        elif action == 4: # pickup
            if self.item_in_car:
                return -10, False
            elif self.item_position != (x, y):
                return -10, False
            else:
                self.item_in_car = True
                return 20, False
        elif action == 5: # drop-off
            if not self.item_in_car:
                return -10, False
            elif self.item_drop_off != (x, y):
                self.item_position = (x, y)
                return -20, False
            else:
                return 20, True

If you let the agent just do random actions, how long will it take for it to succeed (to be done)? Let us try that out.

import random


size = 10
item_start = (0, 0)
item_drop_off = (9, 9)
start_position = (9, 0)

field = Field(size, item_start, item_drop_off, start_position)
done = False
steps = 0
while not done:
    action = random.randrange(0, 6)
    reward, done = field.move_driver(action)
    steps += 1
print(steps)

A single run of that resulted in 2756 steps. That seems to be inefficient. I ran it 1000 times to find an average, which resulted to 1540 steps on average.

Step 5: How the Q-learning algorithm can improve that

There is a learning phase where the Q-table is updated iteratively. But before that, we need to add two helper functions to our Field.

  • We need to be able to map the current it to a state to an index in the Q-table.
  • Further, we need to a get the number of states needed in the Q-table, which we need to know when we initialise the Q-table.
import numpy as np
import random


class Field:
    def __init__(self, size, item_pickup, item_drop_off, start_position):
        self.size_x = size
        self.size_y = size
        self.item_in_car = False
        self.item_position = item_pickup
        self.item_drop_off = item_drop_off
        self.position = start_position

    def get_number_of_states(self):
        return self.size_x*self.size_y*self.size_x*self.size_y*2

    def get_state(self):
        state = self.item_position[0]*(self.size_y*self.size_x*self.size_y*2)
        state += self.item_position[1]*(self.size_x*self.size_y*2)
        state += self.position[0] * (self.size_y * 2)
        state += self.position[1] * (2)
        if self.item_in_car:
            state += 1
        return state

    def move_driver(self, action):
        (x, y) = self.item_position
        if action == 0: # south
            if y == 0:
                return -10, False
            else:
                self.item_position = (x, y-1)
                return -1, False
        elif action == 1: # north
            if y == self.size_y - 1:
                return -10, False
            else:
                self.item_position = (x, y+1)
                return -1, False
        elif action == 2: # east
            if x == self.size_x - 1:
                return -10, False
            else:
                self.item_position = (x+1, y)
                return -1, False
        elif action == 3: # west
            if x == 0:
                return -10, False
            else:
                self.item_position = (x-1, y)
                return -1, False
        elif action == 4: # pickup
            if self.item_in_car:
                return -10, False
            elif self.item_position != (x, y):
                return -10, False
            else:
                self.item_in_car = True
                return 20, False
        elif action == 5: # drop-off
            if not self.item_in_car:
                return -10, False
            elif self.item_drop_off != (x, y):
                self.item_position = (x, y)
                return -20, False
            else:
                return 20, True

Then we can generate our Q-table by iterating over the task 1000 times (it is just an arbitrary number I chose). As you see, it simply just runs over the task again and again, but updates the Q-table with the “learnings” based on the reward.

states = field.get_number_of_states()
actions = 6

q_table = np.zeros((states, actions))

alpha = 0.1
gamma = 0.6
epsilon = 0.1

for i in range(1000):
    field = Field(size, item_start, item_drop_off, start_position)
    done = False
    steps = 0
    while not done:
        state = field.get_state()
        if random.uniform(0, 1) < epsilon:
            action = random.randrange(0, 6)
        else:
            action = np.argmax(q_table[state])

        reward, done = field.move_driver(action)
        next_state = field.get_state()

        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])

        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
        q_table[state, action] = new_value

        steps += 1

After that we can use it, our Q-table is updated. To test it, we will run the same code again, just with the updated Q-table.

alpha = 0.1
gamma = 0.6
epsilon = 0.1

field = Field(size, item_start, item_drop_off, start_position)
done = False
steps = 0
while not done:
    state = field.get_state()
    if random.uniform(0, 1) < epsilon:
        action = random.randrange(0, 6)
    else:
        action = np.argmax(q_table[state])

    reward, done = field.move_driver(action)
    next_state = field.get_state()

    old_value = q_table[state, action]
    next_max = np.max(q_table[next_state])

    new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
    q_table[state, action] = new_value

    steps += 1

print(steps)

This resulted in 22 steps. That is awesome.

4 Easy Steps to Understand Unsupervised Machine Learning with an Example in Python

Step 1: Learn what is unsupervised machine learning?

An unsupervised machine learning model takes unlabelled (or categorised) data and lets the algorithm determined the answer for us.

Unsupervised Machine Learning model - takes unstructured data and finds patterns itself
Unsupervised Machine Learning model – takes unstructured data and finds patterns itself

The unsupervised machine learning model data without apparent structures and tries to identify some patterns itself to create categories.

Step 2: Understand the main types of unsupervised machine learning

There are two main types of unsupervised machine learning types.

  • Clustering: Is used for grouping data into categories without knowing any labels before hand.
  • Association: Is a rule-based for discovering interesting relations between variables in large databases.

In clustering the main algorithms used are K-means, hierarchy clustering, and hidden Markov model.

And in the association the main algorithm used are Apriori and FP-growth.

Step 3: How does K-means work

The K-means works in iterative steps

The k-means algorithm starts is an NP-hard problem, which mean there is no efficient way to solve in the general case. For this problem there are heuristics algorithms that converge fast to local optimum, which means you can find some optimum fast, but it might not be the best one, but often they can do just fine.

Enough, theory.

How does the algorithm work.

  • Step 1: Start by a set of k means. These can be chosen by taking k random point from the dataset (called the Random Partition initialisation method).
  • Step 2: Group each data point into the cluster of the nearest mean. Hence, each data point will be assigned to exactly one cluster.
  • Step 3: Recalculate the the means (also called centroids) to converge towards local optimum.

Steps 2 and 3 are repeated until the grouping in Step 2 does not change any more.

Step 4: A simple Python example with the k-means algorithm

In this example we are going to start assuming you have the basic knowledge how to install the needed libraries. If not, then see the following article.

First of, you need to import the needed libraries.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
from sklearn.cluster import KMeans

In the first basic example we are only going to plot some points on a graph.

style.use('ggplot')

x = [1, 2, 0.3, 9.2, 2.4,  9, 12]
y = [2, 4, 2.5, 8.5, 0.3, 11, 10]
plt.scatter(x, y)
plt.show()

The first line sets a style of the graph. Then we have the coordinates in the arrays x and y. This format is used to feed the scatter.

Output of the plot from scatter plotter in Python.
Output of the plot from scatter plotter in Python.

An advantage of plotting the points before you figure out how many clusters you want to use. Here it looks like there are two “groups” of plots, which translates into using to clusters.

To continue, we want to use the k means algorithm with two clusters.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
from sklearn.cluster import KMeans

style.use('ggplot')

x = [1, 2, 0.3, 9.2, 2.4,  9, 12]
y = [2, 4, 2.5, 8.5, 0.3, 11, 10]

# We need to transform the input coordinates to plot use the k means algorithm
X = []
for i in range(len(x)):
    X.append([x[i], y[i]])
X = np.array(X)

# The number of clusters
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
labels = kmeans.labels_

# Then we want to have different colors for each type.
colors = ['g.', 'r.']
for i in range(len(X)):
    # And plot them one at the time
    plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize=10)

# Plot the centres (or means)
plt.scatter(centroids[:, 0], centroids[:, 1], marker= "x", s=150, linewidths=5, zorder=10)
plt.show()

This results in the following result.

Example of k means algorithm used on simple dataset
Example of k means algorithm used on simple dataset

Considerations when using K-Means algorithm

We could have changed to use 3 clusters. That would have resulted in the following output.

Using 3 clusters instead of two in the k-mean algorithm
Using 3 clusters instead of two in the k-mean algorithm

This is not optimal for this dataset, but could be hard to predict without this visual representation of the dataset.

Uses of K-Means algorithm

Here are some interesting uses of the K-means algorithms:

  • Personalised marketing to users
  • Identifying fake news
  • Spam filter in your inbox

3 Easy Steps to Get Started With Machine Learning: Understand the Concept and Implement Linear Regression in Python

What will we cover in this article?

  • What is Machine Learning and how it can help you?
  • How does Machine Learning work?
  • A first example of Linear Regression in Python

Step 1: How can Machine Learning help you?

Machine Learning is a hot topic these days and it is easy to get confused when people talk about it. But what is Machine Learning and how can it you?

I found the following explanation quite good.

Classical vs modern (No machine learning vs machine learning) approach to predictions.
Classical vs modern (No machine learning vs machine learning) approach to predictions.

In the classical computing model every thing is programmed into the algorithms. This has the limitation that all decision logic need to be understood before usage. And if things change, we need to modify the program.

With the modern computing model (Machine Learning) this paradigm is changes. We feed the algorithms with data, and based on that data, we do the decisions in the program.

While this can seem abstract, this is a big change in thinking programming. Machine Learning has helped computers to have solutions to problems like:

  • Improved search engine results.
  • Voice recognition.
  • Number plate recognition.
  • Categorisation of pictures.
  • …and the list goes on.

Step 2: How does Machine Learning work?

I’m glad you asked. I was wondering about that myself.

On a high level you can divide Machine Learning into two phases.

  • Phase 1: Learning
  • Phase 2: Prediction

The Learning phase is divided into steps.

Machine Learning: The Learning Phase: Training data, Pre-processing, Learning, Testing
Machine Learning: The Learning Phase: Training data, Pre-processing, Learning, Testing

It all starts with a training set (training data). This data set should represent the type of data that the Machine Learn model should be used to predict from in Phase 2 (predction).

The pre-processing step is about cleaning up data. While the Machine Learning is awesome, it cannot figure out what good data looks like. You need to do the cleaning as well as transforming data into a desired format.

Then for the magic, the learning step. There are three main paradigms in machine learning.

  • Supervised: where you tell the algorithm what categories each data item is in. Each data item from the training set is tagged with the right answer.
  • Unsupervised: is when the learning algorithm is not told what to do with it and it should make the structure itself.
  • Reinforcement: teaches the machine to think for itself based on past action rewards.

Finally, the testing is done to see if the model is good. The training data was divided into a test set and training set. The test set is used to see if the model can predict from it. If not, a new model might be necessary.

After that the Prediction Phase begins.

How Machine Learning predicts new data.
How Machine Learning predicts new data.

When the model has been created it will be used to predict based on it from new data.

Step 3: For our first example of Linear Regression in Python

Installing the libraries

Linear regression is a linear approach to modelling the relationship between a scalar response to one or more variables. In the case we try to model, we will do it for one single variable. Said in another way, we want map points on a graph to a line (y = a*x + b).

To do that, we need to import various libraries.

# Importing matplotlib to make a plot
import matplotlib.pyplot as plt
# work with number as array
import numpy as np
# we want to use linear_model (that uses datasets)
from sklearn import linear_model

The matplotlib library is used to make a plot, but is a comprehensive library for creating static, animated, and interactive visualizations in Python. If you do not have it installed you can do that by typing in the following command in a terminal.

pip install matplotlib

The numpy is a powerful library to calculate with N-dimensional arrays. If needed, you can install it by typing the following command in a terminal.

pip install numpy

Finally, you need the linear_model from the sklearn library, which you can install by typing the following command in a terminal.

pip install scikit-learn

Training data set

This simple example will let you make a linear regression of an input of the following data set.

# data set
prices = [245, 312, 279, 308, 199, 409, 200, 400, 230]
size = [50, 60, 35, 55, 30, 65, 30, 75, 25]

Here some items are sold, but each item has a size. The first item was sold for 245 ($) and had a size of 50 (something). The next item was sold to 312 ($) and had a size of 60 (something).

The sizes needs to be reshaped before we model it.

# Importing matplotlib and numpy and sklearn
import matplotlib.pyplot as plt
# work with number as array
import numpy as np
# we want to use linear_model (that uses datasets)
from sklearn import linear_model

# data set
prices = [245, 312, 279, 308, 199, 409, 200, 400, 230]
size = [50, 60, 35, 55, 30, 65, 30, 75, 25]

# reshape the input for regression ( second argument how many items
size2 = np.array(size).reshape((-1, 1))
print(size2)

Which results in the following output.

[[50]
 [60]
 [35]
 [55]
 [30]
 [65]
 [30]
 [75]
 [25]]

Hence, the reshape((-1, 1)) transforms it from a row to a single array.

Then for the linear regression.

# Importing matplotlib and numpy and sklearn
import matplotlib.pyplot as plt
# work with number as array
import numpy as np
# we want to use linear_model (that uses datasets)
from sklearn import linear_model

# data set
prices = [245, 312, 279, 308, 199, 409, 200, 400, 230]
size = [50, 60, 35, 55, 30, 65, 30, 75, 25]

# reshape the input for regression ( second argument how many items
size2 = np.array(size).reshape((-1, 1))
print(size2)

regr = linear_model.LinearRegression()
regr.fit(size2, prices)
print("Coefficients", regr.coef_)
print("intercepts", regr.intercept_)

Which prints out the coefficient (a) and the intercept (b) of a formula y = a*x + b.

Now you can predict future prices, when given a size.

# How to predict
size_new = 60
price = size_new * regr.coef_ + regr.intercept_
print(price)
print(regr.predict([[size_new]]))

Where you both can compute it directly (2nd line) or use the regression model (4th line).

Finally, you can plot the linear regression as a graph.

# Importing matplotlib and numpy and sklearn
import matplotlib.pyplot as plt
# work with number as array
import numpy as np
# we want to use linear_model (that uses datasets)
from sklearn import linear_model

# data set
prices = [245, 312, 279, 308, 199, 409, 200, 400, 230]
size = [50, 60, 35, 55, 30, 65, 30, 75, 25]

# reshape the input for regression ( second argument how many items
size2 = np.array(size).reshape((-1, 1))
print(size2)

regr = linear_model.LinearRegression()
regr.fit(size2, prices)

# Here we plot the graph
x = np.array(range(20, 100))
y = eval('regr.coef_*x + regr.intercept_')
plt.plot(x, y)
plt.scatter(size, prices, color='black')
plt.ylabel('prices')
plt.xlabel('size')
plt.show()

Which results in the following graph.

Example of linear regression in Python
Example of linear regression in Python

Conclusion

This is obviously a simple example of linear regression, as it only has one variable. This simple example shows you how to setup the environment in Python and how to make a simple plot.

How To Get Started with a Predictive Machine Learning Program in Python in 5 Easy Steps

What will you learn?

  • How to predict from a dataset with Machine Learning
  • How to implement that in Python
  • How to get data from Twitter
  • How to install the necessary libraries to do Machine Learning in Python

Step 1: Install the necessary libraries

The sklearn library is a simple and efficient tools for predictive data analysis.

You can install it by typing in the following in your command line.

pip install sklearn

It will most likely install a couple of more needed libraries.

Collecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Collecting scikit-learn
  Downloading scikit_learn-0.23.1-cp38-cp38-macosx_10_9_x86_64.whl (7.2 MB)
     |████████████████████████████████| 7.2 MB 5.0 MB/s 
Collecting numpy>=1.13.3
  Downloading numpy-1.18.4-cp38-cp38-macosx_10_9_x86_64.whl (15.2 MB)
     |████████████████████████████████| 15.2 MB 12.6 MB/s 
Collecting joblib>=0.11
  Downloading joblib-0.15.1-py3-none-any.whl (298 kB)
     |████████████████████████████████| 298 kB 8.1 MB/s 
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-2.1.0-py3-none-any.whl (12 kB)
Collecting scipy>=0.19.1
  Downloading scipy-1.4.1-cp38-cp38-macosx_10_9_x86_64.whl (28.8 MB)
     |████████████████████████████████| 28.8 MB 5.8 MB/s 
Using legacy setup.py install for sklearn, since package 'wheel' is not installed.
Installing collected packages: numpy, joblib, threadpoolctl, scipy, scikit-learn, sklearn
    Running setup.py install for sklearn ... done
Successfully installed joblib-0.15.1 numpy-1.18.4 scikit-learn-0.23.1 scipy-1.4.1 sklearn-0.0 threadpoolctl-2.1.0

As in my installation with numpy, joblib, threadpoolctl, scipy, and scikit-learn.

Step 2: The dataset

The machine learning algorithm needs a dataset to train on. To make this tutorial simple, I only used a limited set. I looked through the top tweets from CNN Breaking and categorised them in positive and negative tweets (I know it can be subjective).

negative = [
    "Protesters who were marching from Minneapolis to St. Paul were tear gassed by police as they tried to cross the Lake Street Marshall Bridge ",
    "The National Guard has been activated in Washington, D.C. to assist police handling protests around the White House",
    "Police have been firing tear gas at the protesters near the 5th Precinct in Minneapolis, where some in the crowd have responded with projectiles of their own",
    "Texas and Colorado have activated the National Guard respond to protests",
    "The mayor of Rochester, New York, has declared a state of emergency and ordered a curfew from 9 p.m. Saturday to 7 a.m. Sunday",
    "Cleveland, Ohio, has enacted a curfew that will go into effect at 8 p.m. Saturday and last through 8 a.m. Sunday",
    "A police car appears to be on fire in Los Angeles. Police officers are holding back a line of demonstrators to prevent them from getting close to the car."
            ]

positive = [
    "Two NASA astronauts make history with their successful launch into space aboard a SpaceX rocket",
    "After questionable weather, officials give the all clear for the SpaceX launch",
    "NASA astronauts Bob Behnken and Doug Hurley climb aboard SpaceX's Crew Dragon spacecraft as they prepare for a mission to the International Space Station",
    "New York Gov. Andrew Cuomo signs a bill giving death benefits to families of frontline workers who died battling the coronavirus pandemic"
]

Step 3: Train the model

The data needs to be categorised to be fed into the training algorithm. Hence, we will make the required structure of the data set.

def prepare_data(positive, negative):
    data = positive + negative
    target = [0]*len(positive) + [1]*len(negative)
    return {'data': data, 'target': target}

The actual training is done by using the sklearn library.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

def train_data_set(data_set):
    x_train, x_test, y_train, y_test = train_test_split(data_set['data'], data_set['target'])

    transfer = TfidfVectorizer()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    estimator = MultinomialNB()
    estimator.fit(x_train, y_train)

    score = estimator.score(x_test, y_test)
    print("score:\n", score)
    return {'transfer': transfer, 'estimator': estimator}

Step 4: Get some tweets from CNN Breaking and predict

In order for this step to work you need to set up tokens for the twitter api. You can follow this tutorial in order to do that.

When you have that you can use the following code to get it running.

import tweepy


def setup_twitter():
    consumer_key = "REPLACE WITH YOUR KEY"
    consumer_secret = "REPLACE WITH YOUR SECRET"
    access_token = "REPLACE WITH YOUR TOKEN"
    access_token_secret = "REPLACE WITH YOUR TOKEN SECRET"

    # authentication of consumer key and secret
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

    # authentication of access token and secret
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth)
    return api


def mood_on_cnn(api, predictor):
    stat = [0, 0]
    for status in tweepy.Cursor(api.user_timeline, screen_name='@cnnbrk', tweet_mode="extended").items():
        sentence_x = predictor['transfer'].transform([status.full_text])
        y_predict = predictor['estimator'].predict(sentence_x)

        stat[y_predict[0]] += 1

    return stat

Step 5: Putting it all together

That is it.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import tweepy


negative = [
    "Protesters who were marching from Minneapolis to St. Paul were tear gassed by police as they tried to cross the Lake Street Marshall Bridge ",
    "The National Guard has been activated in Washington, D.C. to assist police handling protests around the White House",
    "Police have been firing tear gas at the protesters near the 5th Precinct in Minneapolis, where some in the crowd have responded with projectiles of their own",
    "Texas and Colorado have activated the National Guard respond to protests",
    "The mayor of Rochester, New York, has declared a state of emergency and ordered a curfew from 9 p.m. Saturday to 7 a.m. Sunday",
    "Cleveland, Ohio, has enacted a curfew that will go into effect at 8 p.m. Saturday and last through 8 a.m. Sunday",
    "A police car appears to be on fire in Los Angeles. Police officers are holding back a line of demonstrators to prevent them from getting close to the car."
            ]

positive = [
    "Two NASA astronauts make history with their successful launch into space aboard a SpaceX rocket",
    "After questionable weather, officials give the all clear for the SpaceX launch",
    "NASA astronauts Bob Behnken and Doug Hurley climb aboard SpaceX's Crew Dragon spacecraft as they prepare for a mission to the International Space Station",
    "New York Gov. Andrew Cuomo signs a bill giving death benefits to families of frontline workers who died battling the coronavirus pandemic"
]


def prepare_data(positive, negative):
    data = positive + negative
    target = [0]*len(positive) + [1]*len(negative)
    return {'data': data, 'target': target}


def train_data_set(data_set):
    x_train, x_test, y_train, y_test = train_test_split(data_set['data'], data_set['target'])

    transfer = TfidfVectorizer()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    estimator = MultinomialNB()
    estimator.fit(x_train, y_train)

    score = estimator.score(x_test, y_test)
    print("score:\n", score)
    return {'transfer': transfer, 'estimator': estimator}


def setup_twitter():
    consumer_key = "REPLACE WITH YOUR KEY"
    consumer_secret = "REPLACE WITH YOUR SECRET"
    access_token = "REPLACE WITH YOUR TOKEN"
    access_token_secret = "REPLACE WITH YOUR TOKEN SECRET"

    # authentication of consumer key and secret
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

    # authentication of access token and secret
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth)
    return api


def mood_on_cnn(api, predictor):
    stat = [0, 0]
    for status in tweepy.Cursor(api.user_timeline, screen_name='@cnnbrk', tweet_mode="extended").items():
        sentence_x = predictor['transfer'].transform([status.full_text])
        y_predict = predictor['estimator'].predict(sentence_x)

        stat[y_predict[0]] += 1

    return stat


data_set = prepare_data(positive, negative)
predictor = train_data_set(data_set)

api = setup_twitter()
stat = mood_on_cnn(api, predictor)

print(stat)
print("Mood (0 good, 1 bad)", stat[1]/(stat[0] + stat[1]))

I got the following output on the day of writing this tutorial.

score:
 1.0
[751, 2455]
Mood (0 good, 1 bad) 0.765751715533375

I found that the breaking news items are quite negative in taste. Hence, it seems to predict that.