How to Create a Sentiment Analysis model to Predict the Mood of Tweets with Python – 4 Steps to Compare the Mood of Python vs Java

What will we cover in this tutorial?

  • We will learn how the supervised Machine Learning algorithm Sentiment Analysis can be used on twitter data (also, called tweets).
  • The model we use will be Naive Bayes Classifier.
  • The tutorial will help install the necessary Python libraries to get started and how to download training data.
  • Then it will give you a full script to train the model.
  • Finally, we will use the trained model to compare the “mood” of Python with Java.

Step 1: Install the Natural Language Toolkit Library and Download Collections

We will use the Natural Language Toolkit (nltk) library in this tutorial.

NLTK is a leading platform for building Python programs to work with human language data.

http://www.nltk.org

To install the library you should run the following command in a terminal or see here for other alternatives.

pip install nltk

To have the data available that you need to run the following program or see installing NLTK Data.

import nltk
nltk.download()

This will prompt you with a screen similar to this. And select all packages you want to install (I took them all).

Download all packages to NLTK (Natural Language Toolkit)
Download all packages to NLTK (Natural Language Toolkit)

After download you can use the twitter_samples as you need in the example.

Step 2: Reminder of the Sentiment Analysis learning process (Machine Learning)

On a high level you can divide Machine Learning into two phases.

  • Phase 1: Learning
  • Phase 2: Prediction

The Sentiment Analysis model is supervised learning process. The process is defined in the picture below.

The Sentiment Analysis model (Machine Learning) Learning phase
The Sentiment Analysis model (Supervised Machine Learning) Learning phase

On a high level the the learning process of Sentiment Analysis model has the following steps.

  • Training & test data
    • The Sentiment Analysis model is a supervised learning and needs data representing the data that the model should predict. We will use tweets.
    • The data should be categorized into the groups it should be able to distinguish. In our example it will be in positive tweets and negative tweets.
  • Pre-processing
    • First you need to remove “noise”. In our case we remove URL links and Twitter user names.
    • Then you Lemmatize the data to have the words in the same form.
    • Further, you remove stop words as they have no impact of the mood in the tweet.
    • The data then needs to be formatted for the algorithm.
    • Finally, you need to divide it into a training data and testing data.
  • Learning
    • This is where the algorithm builds the model using the training data.
  • Testing
    • Then we test the accuracy of the model with the categorized test data.

Step 3: Train the Sample Data

The twitter_sample contains 5000 positive and 5000 negative tweets, all ready and classified to use in for your training model.

import random
import pickle
from nltk.corpus import twitter_samples
from nltk.stem import WordNetLemmatizer
from nltk.tag import pos_tag
from nltk.corpus import stopwords
from nltk import NaiveBayesClassifier
from nltk import classify

def clean_data(token):
    return [item for item in token if not item.startswith("http") and not item.startswith("@")]

def lemmatization(token):
    lemmatizer = WordNetLemmatizer()
    result = []
    for token, tag in pos_tag(token):
        tag = tag[0].lower()
        token = token.lower()
        if tag in "nva":
            result.append(lemmatizer.lemmatize(token, pos=tag))
        else:
            result.append(lemmatizer.lemmatize(token))
    return result

def remove_stop_words(token, stop_words):
    return [item for item in token if item not in stop_words]

def transform(token):
    result = {}
    for item in token:
        result[item] = True
    return result

def main():
    # Step 1: Gather data
    positive_tweets_tokens = twitter_samples.tokenized('positive_tweets.json')
    negative_tweets_tokens = twitter_samples.tokenized('negative_tweets.json')
    # Step 2: Clean, Lemmatize, and remove Stop Words
    stop_words = stopwords.words('english')
    positive_tweets_tokens_cleaned = [remove_stop_words(lemmatization(clean_data(token)), stop_words) for token in positive_tweets_tokens]
    negative_tweets_tokens_cleaned = [remove_stop_words(lemmatization(clean_data(token)), stop_words) for token in negative_tweets_tokens]
    # Step 3: Transform data
    positive_tweets_tokens_transformed = [(transform(token), "Positive") for token in positive_tweets_tokens_cleaned]
    negative_tweets_tokens_transformed = [(transform(token), "Negative") for token in negative_tweets_tokens_cleaned]

    # Step 4: Create data set
    dataset = positive_tweets_tokens_transformed + negative_tweets_tokens_transformed
    random.shuffle(dataset)
    train_data = dataset[:7000]
    test_data = dataset[7000:]
    # Step 5: Train data
    classifier = NaiveBayesClassifier.train(train_data)
    # Step 6: Test accuracy
    print("Accuracy is:", classify.accuracy(classifier, test_data))
    print(classifier.show_most_informative_features(10))
    # Step 7: Save the pickle
    f = open('my_classifier.pickle', 'wb')
    pickle.dump(classifier, f)
    f.close()

if __name__ == "__main__":
    main()

The code is structured in steps. If you are not comfortable how a the flow of a general machine learning flow is, I can recommend to read this tutorial here or this one.

  • Step 1: Collect and categorize It reads the 5000 positive and 5000 negative twitter samples we downloaded with the nltk.download() call.
  • Step 2: The data needs to be cleaned, Lemmatized and removed for stop words.
    • The clean_data call removes links and twitter users.
    • The call to lemmatization puts words in their base form.
    • The call to remove_stop_words removes all the stop words that have no affect on the mood of the sentence.
  • Step 3: Format data This step transforms the data to the desired format for the NaiveBayesClassifier module.
  • Step 4: Divide data Creates the full data set. Makes a shuffle to take them in different order. Then takes 70% as training data and 30% test data.
    • This data is mixed different from run to run. Hence, it might happen that you will not get the same accuracy like I will in my run.
    • The training data is used to make the model to predict from.
    • The test data is used to compute the accuracy of the model to predict.
  • Step 5: Training model This is the training of the NaiveBayesClassifier model.
    • This is where all the magic happens.
  • Step 6: Accuracy This is testing the accuracy of the model.
  • Step 7: Persist To save the model for use.

I got the following output from the above program.

Accuracy is: 0.9973333333333333
Most Informative Features
                      :) = True           Positi : Negati =   1010.7 : 1.0
                     sad = True           Negati : Positi =     25.4 : 1.0
                     bam = True           Positi : Negati =     20.2 : 1.0
                  arrive = True           Positi : Negati =     18.3 : 1.0
                     x15 = True           Negati : Positi =     17.2 : 1.0
               community = True           Positi : Negati =     14.7 : 1.0
                    glad = True           Positi : Negati =     12.6 : 1.0
                   enjoy = True           Positi : Negati =     12.0 : 1.0
                    kill = True           Negati : Positi =     12.0 : 1.0
                     ugh = True           Negati : Positi =     11.3 : 1.0
None

Step 4: Use the Sentiment Analysis prediction model

Now we can determine the mood of a tweet. To have some fun let us try to figure out the mood of tweets with Python and compare it with Java.

To do that, you need to have setup your twitter developer account. If you do not have that already, then see the this tutorial on how to do that.

In the code below you need to fill out your consumer_key, consumer_secret, access_token, and access_token_secret.

import pickle
import tweepy

def get_twitter_api():
    # personal details
    consumer_key = "___INSERT YOUR DATA HERE___"
    consumer_secret = "___INSERT YOUR DATA HERE___"
    access_token = "___INSERT YOUR DATA HERE___"
    access_token_secret = "___INSERT YOUR DATA HERE___"
    # authentication of consumer key and secret
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    # authentication of access token and secret
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth)
    return api

# This function uses the functions from the learner code above
def tokenize(tweet):
    return remove_noise(word_tokenize(tweet))

def get_classifier(pickle_name):
    f = open(pickle_name, 'rb')
    classifier = pickle.load(f)
    f.close()
    return classifier

def find_mood(search):
    classifier = get_classifier('my_classifier.pickle')
    api = get_twitter_api()
    stat = {
        "Positive": 0,
        "Negative": 0
    }
    for tweet in tweepy.Cursor(api.search, q=search).items(1000):
        custom_tokens = tokenize(tweet.text)
        category = classifier.classify(dict([token, True] for token in custom_tokens))
        stat[category] += 1
    print("The mood of", search)
    print(" - Positive", stat["Positive"], round(stat["Positive"]*100/(stat["Positive"] + stat["Negative"]), 1))
    print(" - Negative", stat["Negative"], round(stat["Negative"]*100/(stat["Positive"] + stat["Negative"]), 1))

if __name__ == "__main__":
    find_mood("#java")
    find_mood("#python")

That is it. Obviously the mood of Python is better. It is easier than Java.

The mood of #java
 - Positive 524 70.4
 - Negative 220 29.6
The mood of #python
 - Positive 753 75.3
 - Negative 247 24.7

If you want to learn more about Python I can encourage you to take my course here.

5 Steps to Master the Reinforcement Learning with a Q-Learning Python Example

What will we learn in this article?

The Q-Learning algorithm is a nice and easy to understand algorithm used with Reinforcement Learning paradigm in Machine Learning. It can be implemented from scratch and we will do that in this article.

After you go through this article you will know what Reinforcement Learning is, the main types of algorithm used, fully understand the Q-learning algorithm and implement an awesome example from scratch in Python.

The steps towards that are.

  • Learn and understand what reinforcement learning in machine learning?
  • What are the main algorithm in reinforcement learning?
  • Deep dive to understand the Q-learning algorithm
  • Implement a task that we want the Q-learning algorithm to learn – first we let a random choices try (1540 steps on average).
  • Then we implement the Q-learning algorithm from scratch and let it solve learn how to solve it (22 steps).

Step 1: What is Reinforcement Learning?

Reinforcement learning teaches the machine to think for itself based on past action rewards.

Reinforcement Learning (in Machine Learning) teaches the machine to think based on past action rewards.
Reinforcement Learning (in Machine Learning) teaches the machine to think based on past action rewards.

Basically, the Reinforcement Learning algorithm tries to predict actions that gives rewards and avoids punishment.

It is like training a dog. You and the dog do not talk the same language, but the dogs learns how to act based on rewards (and punishment, which I do not advise or advocate).

Hence, if a dog is rewarded for a certain action in a given situation, then next time it is exposed to a similar situation it will act the same.

Translate that to Reinforcement Learning.

  • The agent is the dog that is exposed to the environment.
  • Then the agent encounters a state.
  • The agent performs an action to transition from that state to a new state.
  • Then after the transition the agent receives a reward or penalty (punishment).
  • This forms a policy to create a strategy to choose actions in a given state.

Step 2: What are the algorithm used for Reinforcement Learning?

The most common algorithm for Reinforcement Learning are.

We will focus on the Q-learning algorithm as it is easy to understand as well as powerful.

Step 3: Understand the Q-Learning algorithm

As already noted, I just love this algorithm. It is “easy” to understand and seems very powerful.

Q-Learning algorithm (Reinforcement / Machine Learning) - exploit or explore - Update Q-table
Q-Learning algorithm (Reinforcement / Machine Learning) – exploit or explore – Update Q-table

The Q-Learning algorithm has a Q-table (a Matrix of dimension state x actions – don’t worry if you do not understand what a Matrix is, you will not need the mathematical aspects of it – it is just an indexed “container” with numbers).

  • The agent (or Q-Learning algorithm) will be in a state.
  • Then in each iteration the agent needs take an action.
  • The agent will continuously update the reward in the Q-table.
  • The learning can come from either exploiting or exploring.

This translates into the following pseudo algorithm for the Q-Learning.

The agent is in a given state and needs to choose an action.

  • Initialise the Q-table to all zeros
  • Iterate:
    • Agent is in state state.
    • With probability epsilon choose to explore, else exploit.
      • If explore, then choose a random action.
      • If exploit, then choose the best action based on the current Q-table.
    • Update the Q-table from the new reward to the previous state.
      • Q[state, action] = (1 – alpha) * Q[state, action] + alpha * (reward + gamma * max(Q[new_state]) — Q[state, action])

As you can se, we have introduced the following variables.

  • epsilon: the probability to take a random action, which is done to explore new territory.
  • alpha: is the learning rate that the algorithm should make in each iteration and should be in the interval from 0 to 1.
  • gamma: is the discount factor used to balance the immediate and future reward. This value is usually between 0.8 and 0.99
  • reward: is the feedback on the action and can be any number. Negative is penalty (or punishment) and positive is a reward.

Step 4: A task we want the Q-learning algorithm to master

We need to test and understand our the above algorithm. So far, it is quite abstract. To do that we will create a simple task to show how the Q-learning algorithm will solve it efficient by learning by rewards.

To keep it simple, we create a field of size 10×10 positions. In that field there is an item that needs to be picket up and moved to a drop-off point.

At each position there are 6 different actions that can be taken.

  • Action 0: Go south if on field.
  • Action 1: Go north if on field.
  • Action 2: Go east if on field.
  • Action 3: Go west if on field.
  • Action 4: Pickup item (it can try even if it is not there)
  • Action 5: Drop-off item (it can try even if it does not have it)

Based on these action we will make a reward system.

  • If the agent tries to go off the field, punish with -10 in reward.
  • If the agent makes a (legal) move, punish with -1 in reward, as we do not want to encourage endless walking around.
  • If the agent tries to pick up item, but it is not there or it has it already, punish with 10.
  • If the agent picks up the item correct place, reward with 20.
  • If agent tries to drop-off item in wrong place or does not have the item, punish with 10.
  • If the agent drops-off item in correct place, reward with 20.

That translates into the following code. I prefer to implement this code, as I think the standard libraries that provide similar frameworks hide some important details. As an example, and shown later, how do you map this into a state in the Q-table?

class Field:
    def __init__(self, size, item_pickup, item_drop_off, start_position):
        self.size_x = size
        self.size_y = size
        self.item_in_car = False
        self.item_position = item_pickup
        self.item_drop_off = item_drop_off
        self.position = start_position
    def move_driver(self, action):
        (x, y) = self.item_position
        if action == 0: # south
            if y == 0:
                return -10, False
            else:
                self.item_position = (x, y-1)
                return -1, False
        elif action == 1: # north
            if y == self.size_y - 1:
                return -10, False
            else:
                self.item_position = (x, y+1)
                return -1, False
        elif action == 2: # east
            if x == self.size_x - 1:
                return -10, False
            else:
                self.item_position = (x+1, y)
                return -1, False
        elif action == 3: # west
            if x == 0:
                return -10, False
            else:
                self.item_position = (x-1, y)
                return -1, False
        elif action == 4: # pickup
            if self.item_in_car:
                return -10, False
            elif self.item_position != (x, y):
                return -10, False
            else:
                self.item_in_car = True
                return 20, False
        elif action == 5: # drop-off
            if not self.item_in_car:
                return -10, False
            elif self.item_drop_off != (x, y):
                self.item_position = (x, y)
                return -20, False
            else:
                return 20, True

If you let the agent just do random actions, how long will it take for it to succeed (to be done)? Let us try that out.

import random

size = 10
item_start = (0, 0)
item_drop_off = (9, 9)
start_position = (9, 0)
field = Field(size, item_start, item_drop_off, start_position)
done = False
steps = 0
while not done:
    action = random.randrange(0, 6)
    reward, done = field.move_driver(action)
    steps += 1
print(steps)

A single run of that resulted in 2756 steps. That seems to be inefficient. I ran it 1000 times to find an average, which resulted to 1540 steps on average.

Step 5: How the Q-learning algorithm can improve that

There is a learning phase where the Q-table is updated iteratively. But before that, we need to add two helper functions to our Field.

  • We need to be able to map the current it to a state to an index in the Q-table.
  • Further, we need to a get the number of states needed in the Q-table, which we need to know when we initialise the Q-table.
import numpy as np
import random

class Field:
    def __init__(self, size, item_pickup, item_drop_off, start_position):
        self.size_x = size
        self.size_y = size
        self.item_in_car = False
        self.item_position = item_pickup
        self.item_drop_off = item_drop_off
        self.position = start_position
    def get_number_of_states(self):
        return self.size_x*self.size_y*self.size_x*self.size_y*2
    def get_state(self):
        state = self.item_position[0]*(self.size_y*self.size_x*self.size_y*2)
        state += self.item_position[1]*(self.size_x*self.size_y*2)
        state += self.position[0] * (self.size_y * 2)
        state += self.position[1] * (2)
        if self.item_in_car:
            state += 1
        return state
    def move_driver(self, action):
        (x, y) = self.item_position
        if action == 0: # south
            if y == 0:
                return -10, False
            else:
                self.item_position = (x, y-1)
                return -1, False
        elif action == 1: # north
            if y == self.size_y - 1:
                return -10, False
            else:
                self.item_position = (x, y+1)
                return -1, False
        elif action == 2: # east
            if x == self.size_x - 1:
                return -10, False
            else:
                self.item_position = (x+1, y)
                return -1, False
        elif action == 3: # west
            if x == 0:
                return -10, False
            else:
                self.item_position = (x-1, y)
                return -1, False
        elif action == 4: # pickup
            if self.item_in_car:
                return -10, False
            elif self.item_position != (x, y):
                return -10, False
            else:
                self.item_in_car = True
                return 20, False
        elif action == 5: # drop-off
            if not self.item_in_car:
                return -10, False
            elif self.item_drop_off != (x, y):
                self.item_position = (x, y)
                return -20, False
            else:
                return 20, True

Then we can generate our Q-table by iterating over the task 1000 times (it is just an arbitrary number I chose). As you see, it simply just runs over the task again and again, but updates the Q-table with the “learnings” based on the reward.

states = field.get_number_of_states()
actions = 6
q_table = np.zeros((states, actions))
alpha = 0.1
gamma = 0.6
epsilon = 0.1
for i in range(1000):
    field = Field(size, item_start, item_drop_off, start_position)
    done = False
    steps = 0
    while not done:
        state = field.get_state()
        if random.uniform(0, 1) < epsilon:
            action = random.randrange(0, 6)
        else:
            action = np.argmax(q_table[state])
        reward, done = field.move_driver(action)
        next_state = field.get_state()
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])
        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
        q_table[state, action] = new_value
        steps += 1

After that we can use it, our Q-table is updated. To test it, we will run the same code again, just with the updated Q-table.

alpha = 0.1
gamma = 0.6
epsilon = 0.1
field = Field(size, item_start, item_drop_off, start_position)
done = False
steps = 0
while not done:
    state = field.get_state()
    if random.uniform(0, 1) < epsilon:
        action = random.randrange(0, 6)
    else:
        action = np.argmax(q_table[state])
    reward, done = field.move_driver(action)
    next_state = field.get_state()
    old_value = q_table[state, action]
    next_max = np.max(q_table[next_state])
    new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
    q_table[state, action] = new_value
    steps += 1
print(steps)

This resulted in 22 steps. That is awesome.

4 Easy Steps to Understand Unsupervised Machine Learning with an Example in Python

Step 1: Learn what is unsupervised machine learning?

An unsupervised machine learning model takes unlabelled (or categorised) data and lets the algorithm determined the answer for us.

Unsupervised Machine Learning model - takes unstructured data and finds patterns itself
Unsupervised Machine Learning model – takes unstructured data and finds patterns itself

The unsupervised machine learning model data without apparent structures and tries to identify some patterns itself to create categories.

Step 2: Understand the main types of unsupervised machine learning

There are two main types of unsupervised machine learning types.

  • Clustering: Is used for grouping data into categories without knowing any labels before hand.
  • Association: Is a rule-based for discovering interesting relations between variables in large databases.

In clustering the main algorithms used are K-means, hierarchy clustering, and hidden Markov model.

And in the association the main algorithm used are Apriori and FP-growth.

Step 3: How does K-means work

The K-means works in iterative steps

The k-means algorithm starts is an NP-hard problem, which mean there is no efficient way to solve in the general case. For this problem there are heuristics algorithms that converge fast to local optimum, which means you can find some optimum fast, but it might not be the best one, but often they can do just fine.

Enough, theory.

How does the algorithm work.

  • Step 1: Start by a set of k means. These can be chosen by taking k random point from the dataset (called the Random Partition initialisation method).
  • Step 2: Group each data point into the cluster of the nearest mean. Hence, each data point will be assigned to exactly one cluster.
  • Step 3: Recalculate the the means (also called centroids) to converge towards local optimum.

Steps 2 and 3 are repeated until the grouping in Step 2 does not change any more.

Step 4: A simple Python example with the k-means algorithm

In this example we are going to start assuming you have the basic knowledge how to install the needed libraries. If not, then see the following article.

First of, you need to import the needed libraries.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
from sklearn.cluster import KMeans

In the first basic example we are only going to plot some points on a graph.

style.use('ggplot')
x = [1, 2, 0.3, 9.2, 2.4,  9, 12]
y = [2, 4, 2.5, 8.5, 0.3, 11, 10]
plt.scatter(x, y)
plt.show()

The first line sets a style of the graph. Then we have the coordinates in the arrays x and y. This format is used to feed the scatter.

Output of the plot from scatter plotter in Python.
Output of the plot from scatter plotter in Python.

An advantage of plotting the points before you figure out how many clusters you want to use. Here it looks like there are two “groups” of plots, which translates into using to clusters.

To continue, we want to use the k means algorithm with two clusters.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
from sklearn.cluster import KMeans
style.use('ggplot')
x = [1, 2, 0.3, 9.2, 2.4,  9, 12]
y = [2, 4, 2.5, 8.5, 0.3, 11, 10]
# We need to transform the input coordinates to plot use the k means algorithm
X = []
for i in range(len(x)):
    X.append([x[i], y[i]])
X = np.array(X)
# The number of clusters
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
labels = kmeans.labels_
# Then we want to have different colors for each type.
colors = ['g.', 'r.']
for i in range(len(X)):
    # And plot them one at the time
    plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize=10)
# Plot the centres (or means)
plt.scatter(centroids[:, 0], centroids[:, 1], marker= "x", s=150, linewidths=5, zorder=10)
plt.show()

This results in the following result.

Example of k means algorithm used on simple dataset
Example of k means algorithm used on simple dataset

Considerations when using K-Means algorithm

We could have changed to use 3 clusters. That would have resulted in the following output.

Using 3 clusters instead of two in the k-mean algorithm
Using 3 clusters instead of two in the k-mean algorithm

This is not optimal for this dataset, but could be hard to predict without this visual representation of the dataset.

Uses of K-Means algorithm

Here are some interesting uses of the K-means algorithms:

  • Personalised marketing to users
  • Identifying fake news
  • Spam filter in your inbox