Twitter-API + Python: Mapping all Your Followers Locations on a Choropleth Map

What will we cover in this tutorial?

How to find all the locations of your followers on Twitter and create a choropleth map (maps where the color of each shape is based on the value of an associated variable) with all countries. This will all be done by using Python.

This is done in connection with my interest of where the followers are from on my Twitter account. Today my result looks like this.

The Choropleth map of the followers of PythonWithRune on Twitter

Step 1: How to get the followers from your Twitter account

If you are new to Twitter API you will need to create a developer account to get your secret key. You can follow this tutorial to create you developer account and get the needed tokens.

When that is done, you can use the tweepy library to connect to the Twitter API. The library function api.followers_ids(api.me().id) will give you a list of all your followers by user-id.

import tweepy

# Used to connect to the Twitter API
def get_twitter_api():
    # You need your own keys/secret/tokens here
    consumer_key = "--- INSERT YOUR KEY HERE ---"
    consumer_secret = "--- INSERT YOUR SECRET HERE ---"
    access_token = "--- INSERT YOUR TOKEN HERE ---"
    access_token_secret = "--- INSERT YOUR TOKEN SECRET HERE ---"

    # authentication of consumer key and secret
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

    # authentication of access token and secret
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth, wait_on_rate_limit=True)
    return api


# This function is used to process it all
def process():
    # Connecting to the twitter api
    api = get_twitter_api()

    # Get the list of all your followers - it only gives user-id's
    # - we need to gather all user data after
    followers = api.followers_ids(api.me().id)
    print("Followers", len(followers))


if __name__ == "__main__":
    process()

Which will print out the number of followers you have on your account.

Step 2: Get the location of your followers

How do we transform the twitter user-ids to a location?

We need to look them all up. Luckily, not one-by-one. We can do it in chunks of 100 users per call.

The function api.lookup_users(…) can lookup 100 users per call with users-ids or user-names.

import tweepy

# Used to connect to the Twitter API
def get_twitter_api():
    # You need your own keys/secret/tokens here
    consumer_key = "--- INSERT YOUR KEY HERE ---"
    consumer_secret = "--- INSERT YOUR SECRET HERE ---"
    access_token = "--- INSERT YOUR TOKEN HERE ---"
    access_token_secret = "--- INSERT YOUR TOKEN SECRET HERE ---"

    # authentication of consumer key and secret
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

    # authentication of access token and secret
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth, wait_on_rate_limit=True)
    return api


# This function is used to process it all
def process():
    # Connecting to the twitter api
    api = get_twitter_api()

    # Get the list of all your followers - it only gives user-id's
    # - we need to gather all user data after
    followers = api.followers_ids(api.me().id)
    print("Followers", len(followers))

    # We need to chunk it up in sizes of 100 (max for api.lookup_users)
    followers_chunks = [followers[i:i + 100] for i in range(0, len(followers), 100)]
    # Process each chunk - we can call for 100 users per call
    for follower_chunk in followers_chunks:
        # Get a list of users (with location data)
        users = api.lookup_users(user_ids=follower_chunk)
        # Process each user to get location
        for user in users:
            # Print user location
            print(user.location)


if __name__ == "__main__":
    process()

Before you execute this code, you should now it will print all the locations that all your followers have set.

Step 3: Map all user locations to the same format

When users write their locations, it is done in various ways. As this example shows.

India
Kenya
Temecula, CA
Atlanta, GA
Florida, United States
Hyderabad, India
Atlanta, GA
Agadir / Khouribga, Morocco
Miami, FL
Republic of the Philippines
Tampa, FL
Sammamish, WA
Coffee-machine

And as the last example shows, it might not be a real location. Hence, we need to see if we can find the location by asking a service. For this purpose, we will use the GeoPy library, which is a client for several popular geocoding web services.

Hence, for each of the user specified locations (as the examples above) we will call GeoPy and use the result from it as the location. This will bring everything in the same format or clarify if the location exists.

import tweepy
from geopy.exc import GeocoderTimedOut
from geopy.geocoders import Nominatim


# Used to connect to the Twitter API
def get_twitter_api():
    # You need your own keys/secret/tokens here
    consumer_key = "--- INSERT YOUR KEY HERE ---"
    consumer_secret = "--- INSERT YOUR SECRET HERE ---"
    access_token = "--- INSERT YOUR TOKEN HERE ---"
    access_token_secret = "--- INSERT YOUR TOKEN SECRET HERE ---"

    # authentication of consumer key and secret
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

    # authentication of access token and secret
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth, wait_on_rate_limit=True)
    return api


# Used to map the twitter user location description to a standard format
def lookup_location(location):
    geo_locator = Nominatim(user_agent="LearnPython")
    try:
        location = geo_locator.geocode(location, language='en')
    except GeocoderTimedOut:
        return None
    return location


# This function is used to process it all
def process():
    # Connecting to the twitter api
    api = get_twitter_api()

    # Get the list of all your followers - it only gives user-id's
    # - we need to gather all user data after
    followers = api.followers_ids(api.me().id)
    print("Followers", len(followers))

    # Used to store all the locations from users
    locations = {}

    # We need to chunk it up in sizes of 100 (max for api.lookup_users)
    followers_chunks = [followers[i:i + 100] for i in range(0, len(followers), 100)]
    # Process each chunk - we can call for 100 users per call
    for follower_chunk in followers_chunks:
        # Get a list of users (with location data)
        users = api.lookup_users(user_ids=follower_chunk)
        # Process each user to get location
        for user in users:
            # Call used to transform users description of location to same format
            location = lookup_location(user.location)
            # Add it to our counter
            if location:
                location = location.address
                location = location.split(',')[-1].strip()
            if location in locations:
                locations[location] += 1
            else:
                locations[location] = 1


if __name__ == "__main__":
    process()

As you see, it will count the occurrences of each location found. The split and strip is used to get the country and leave out the rest of the address if any.

Step 4: Reformat the locations into a Pandas DataFrame

We want to reformat the locations into a DataFrame to be able to join (merge) it with GeoPandas, which contains the choropleth map we want to use.

To convert the locations into a DataFrame we need to restructure it. This will also helps us to remove duplicates. As an example, United States and United States of America both appear. To handle that we will map all country names to a 3 letter code. We will use the pycountry library for that.

import tweepy
import pycountry
import pandas as pd
from geopy.exc import GeocoderTimedOut
from geopy.geocoders import Nominatim


# Used to connect to the Twitter API
def get_twitter_api():
    # You need your own keys/secret/tokens here
    consumer_key = "--- INSERT YOUR KEY HERE ---"
    consumer_secret = "--- INSERT YOUR SECRET HERE ---"
    access_token = "--- INSERT YOUR TOKEN HERE ---"
    access_token_secret = "--- INSERT YOUR TOKEN SECRET HERE ---"

    # authentication of consumer key and secret
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

    # authentication of access token and secret
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth, wait_on_rate_limit=True)
    return api


# Helper function to map country names to alpha_3 representation
# Some are not supported - and are hard-coded in
# Function used to map country names from GeoPandas and the country names from geo_locator
def lookup_country_code(country):
    try:
        alpha_3 = pycountry.countries.lookup(country).alpha_3
        return alpha_3
    except LookupError:
        if country == 'The Netherlands':
            country = 'NLD'
        elif country == 'Democratic Republic of the Congo':
            country = 'COG'
        return country


# Used to map the twitter user location description to a standard format
def lookup_location(location):
    geo_locator = Nominatim(user_agent="LearnPython")
    try:
        location = geo_locator.geocode(location, language='en')
    except GeocoderTimedOut:
        return None
    return location


# This function is used to process it all
def process():
    # Connecting to the twitter api
    api = get_twitter_api()

    # Get the list of all your followers - it only gives user-id's
    # - we need to gather all user data after
    followers = api.followers_ids(api.me().id)
    print("Followers", len(followers))

    # Used to store all the locations from users
    locations = {}

    # We need to chunk it up in sizes of 100 (max for api.lookup_users)
    followers_chunks = [followers[i:i + 100] for i in range(0, len(followers), 100)]
    # Process each chunk - we can call for 100 users per call
    for follower_chunk in followers_chunks:
        # Get a list of users (with location data)
        users = api.lookup_users(user_ids=follower_chunk)
        # Process each user to get location
        for user in users:
            # Call used to transform users description of location to same format
            location = lookup_location(user.location)
            # Add it to our counter
            if location:
                location = location.address
                location = location.split(',')[-1].strip()
            if location in locations:
                locations[location] += 1
            else:
                locations[location] = 1

    # We reformat the output fo locations
    # Done for two reasons
    # - 1) Some locations have two entries (e.g., United States and United States of America)
    # - 2) To map them into a simple format to join it with GeoPandas
    reformat = {'alpha_3': [], 'followers': []}
    for location in locations:
        print(location, locations[location])
        loc = lookup_country_code(location)
        if loc in reformat['alpha_3']:
            index = reformat['alpha_3'].index(loc)
            reformat['followers'][index] += locations[location]
        else:
            reformat['alpha_3'].append(loc)
            reformat['followers'].append(locations[location])

    # Convert the reformat into a dictionary to join (merge) with GeoPandas
    followers = pd.DataFrame.from_dict(reformat)
    pd.set_option('display.max_columns', 50)
    pd.set_option('display.width', 1000)
    pd.set_option('display.max_rows', 300)
    print(followers.sort_values(by=['followers'], ascending=False))

if __name__ == "__main__":
    process()

That makes it ready to join (merge) with GeoPandas.

Step 5: Merge it with GeoPandas and show the choropleth map

Now for the fun part. We only need to load the geo data from GeoPandas and merge our newly created DataFrame with it. Finally, plot and show it using matplotlib.pyplot.

import tweepy
import pycountry
import pandas as pd
import geopandas
import matplotlib.pyplot as plt
from geopy.exc import GeocoderTimedOut
from geopy.geocoders import Nominatim


# Used to connect to the Twitter API
def get_twitter_api():
    # You need your own keys/secret/tokens here
    consumer_key = "--- INSERT YOUR KEY HERE ---"
    consumer_secret = "--- INSERT YOUR SECRET HERE ---"
    access_token = "--- INSERT YOUR TOKEN HERE ---"
    access_token_secret = "--- INSERT YOUR TOKEN SECRET HERE ---"

    # authentication of consumer key and secret
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

    # authentication of access token and secret
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth, wait_on_rate_limit=True)
    return api


# Helper function to map country names to alpha_3 representation
# Some are not supported - and are hard-coded in
# Function used to map country names from GeoPandas and the country names from geo_locator
def lookup_country_code(country):
    try:
        alpha_3 = pycountry.countries.lookup(country).alpha_3
        return alpha_3
    except LookupError:
        if country == 'The Netherlands':
            country = 'NLD'
        elif country == 'Democratic Republic of the Congo':
            country = 'COG'
        return country


# Used to map the twitter user location description to a standard format
def lookup_location(location):
    geo_locator = Nominatim(user_agent="LearnPython")
    try:
        location = geo_locator.geocode(location, language='en')
    except GeocoderTimedOut:
        return None
    return location


# This function is used to process it all
def process():
    # Connecting to the twitter api
    api = get_twitter_api()

    # Get the list of all your followers - it only gives user-id's
    # - we need to gather all user data after
    followers = api.followers_ids(api.me().id)
    print("Followers", len(followers))

    # Used to store all the locations from users
    locations = {}

    # We need to chunk it up in sizes of 100 (max for api.lookup_users)
    followers_chunks = [followers[i:i + 100] for i in range(0, len(followers), 100)]
    # Process each chunk - we can call for 100 users per call
    for follower_chunk in followers_chunks:
        # Get a list of users (with location data)
        users = api.lookup_users(user_ids=follower_chunk)
        # Process each user to get location
        for user in users:
            # Call used to transform users description of location to same format
            location = lookup_location(user.location)
            # Add it to our counter
            if location:
                location = location.address
                location = location.split(',')[-1].strip()
            if location in locations:
                locations[location] += 1
            else:
                locations[location] = 1

    # We reformat the output fo locations
    # Done for two reasons
    # - 1) Some locations have two entries (e.g., United States and United States of America)
    # - 2) To map them into a simple format to join it with GeoPandas
    reformat = {'alpha_3': [], 'followers': []}
    for location in locations:
        print(location, locations[location])
        loc = lookup_country_code(location)
        if loc in reformat['alpha_3']:
            index = reformat['alpha_3'].index(loc)
            reformat['followers'][index] += locations[location]
        else:
            reformat['alpha_3'].append(loc)
            reformat['followers'].append(locations[location])

    # Convert the reformat into a dictionary to join (merge) with GeoPandas
    followers = pd.DataFrame.from_dict(reformat)
    pd.set_option('display.max_columns', 50)
    pd.set_option('display.width', 1000)
    pd.set_option('display.max_rows', 300)
    print(followers.sort_values(by=['followers'], ascending=False))

    # Read the GeoPandas
    world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
    # Remove the columns not needed
    world = world.drop(['pop_est', 'continent', 'iso_a3', 'gdp_md_est'], axis=1)
    # Map the same naming convention as followers (the above DataFrame)
    # - this step is needed, because the iso_a3 column was missing a few countries 
    world['iso_a3'] = world.apply(lambda row: lookup_country_code(row['name']), axis=1)
    # Merge the tables (DataFrames)
    table = world.merge(followers, how="left", left_on=['iso_a3'], right_on=['alpha_3'])

    # Plot the data in a graph
    table.plot(column='followers', figsize=(8, 6))
    plt.show()


if __name__ == "__main__":
    process()

Resulting in the following output (for PythonWithRune twitter account (not yours)).

How to Create a Sentiment Analysis model to Predict the Mood of Tweets with Python – 4 Steps to Compare the Mood of Python vs Java

What will we cover in this tutorial?

  • We will learn how the supervised Machine Learning algorithm Sentiment Analysis can be used on twitter data (also, called tweets).
  • The model we use will be Naive Bayes Classifier.
  • The tutorial will help install the necessary Python libraries to get started and how to download training data.
  • Then it will give you a full script to train the model.
  • Finally, we will use the trained model to compare the “mood” of Python with Java.

Step 1: Install the Natural Language Toolkit Library and Download Collections

We will use the Natural Language Toolkit (nltk) library in this tutorial.

NLTK is a leading platform for building Python programs to work with human language data.

http://www.nltk.org

To install the library you should run the following command in a terminal or see here for other alternatives.

pip install nltk

To have the data available that you need to run the following program or see installing NLTK Data.

import nltk
nltk.download()

This will prompt you with a screen similar to this. And select all packages you want to install (I took them all).

Download all packages to NLTK (Natural Language Toolkit)
Download all packages to NLTK (Natural Language Toolkit)

After download you can use the twitter_samples as you need in the example.

Step 2: Reminder of the Sentiment Analysis learning process (Machine Learning)

On a high level you can divide Machine Learning into two phases.

  • Phase 1: Learning
  • Phase 2: Prediction

The Sentiment Analysis model is supervised learning process. The process is defined in the picture below.

The Sentiment Analysis model (Machine Learning) Learning phase
The Sentiment Analysis model (Supervised Machine Learning) Learning phase

On a high level the the learning process of Sentiment Analysis model has the following steps.

  • Training & test data
    • The Sentiment Analysis model is a supervised learning and needs data representing the data that the model should predict. We will use tweets.
    • The data should be categorized into the groups it should be able to distinguish. In our example it will be in positive tweets and negative tweets.
  • Pre-processing
    • First you need to remove “noise”. In our case we remove URL links and Twitter user names.
    • Then you Lemmatize the data to have the words in the same form.
    • Further, you remove stop words as they have no impact of the mood in the tweet.
    • The data then needs to be formatted for the algorithm.
    • Finally, you need to divide it into a training data and testing data.
  • Learning
    • This is where the algorithm builds the model using the training data.
  • Testing
    • Then we test the accuracy of the model with the categorized test data.

Step 3: Train the Sample Data

The twitter_sample contains 5000 positive and 5000 negative tweets, all ready and classified to use in for your training model.

import random
import pickle

from nltk.corpus import twitter_samples
from nltk.stem import WordNetLemmatizer
from nltk.tag import pos_tag
from nltk.corpus import stopwords
from nltk import NaiveBayesClassifier
from nltk import classify


def clean_data(token):
    return [item for item in token if not item.startswith("http") and not item.startswith("@")]


def lemmatization(token):
    lemmatizer = WordNetLemmatizer()

    result = []
    for token, tag in pos_tag(token):
        tag = tag[0].lower()
        token = token.lower()
        if tag in "nva":
            result.append(lemmatizer.lemmatize(token, pos=tag))
        else:
            result.append(lemmatizer.lemmatize(token))
    return result


def remove_stop_words(token, stop_words):
    return [item for item in token if item not in stop_words]


def transform(token):
    result = {}
    for item in token:
        result[item] = True
    return result


def main():
    # Step 1: Gather data
    positive_tweets_tokens = twitter_samples.tokenized('positive_tweets.json')
    negative_tweets_tokens = twitter_samples.tokenized('negative_tweets.json')

    # Step 2: Clean, Lemmatize, and remove Stop Words
    stop_words = stopwords.words('english')
    positive_tweets_tokens_cleaned = [remove_stop_words(lemmatization(clean_data(token)), stop_words) for token in positive_tweets_tokens]
    negative_tweets_tokens_cleaned = [remove_stop_words(lemmatization(clean_data(token)), stop_words) for token in negative_tweets_tokens]

    # Step 3: Transform data
    positive_tweets_tokens_transformed = [(transform(token), "Positive") for token in positive_tweets_tokens_cleaned]
    negative_tweets_tokens_transformed = [(transform(token), "Negative") for token in negative_tweets_tokens_cleaned]


    # Step 4: Create data set
    dataset = positive_tweets_tokens_transformed + negative_tweets_tokens_transformed
    random.shuffle(dataset)

    train_data = dataset[:7000]
    test_data = dataset[7000:]

    # Step 5: Train data
    classifier = NaiveBayesClassifier.train(train_data)

    # Step 6: Test accuracy
    print("Accuracy is:", classify.accuracy(classifier, test_data))
    print(classifier.show_most_informative_features(10))

    # Step 7: Save the pickle
    f = open('my_classifier.pickle', 'wb')
    pickle.dump(classifier, f)
    f.close()


if __name__ == "__main__":
    main()

The code is structured in steps. If you are not comfortable how a the flow of a general machine learning flow is, I can recommend to read this tutorial here or this one.

  • Step 1: Collect and categorize It reads the 5000 positive and 5000 negative twitter samples we downloaded with the nltk.download() call.
  • Step 2: The data needs to be cleaned, Lemmatized and removed for stop words.
    • The clean_data call removes links and twitter users.
    • The call to lemmatization puts words in their base form.
    • The call to remove_stop_words removes all the stop words that have no affect on the mood of the sentence.
  • Step 3: Format data This step transforms the data to the desired format for the NaiveBayesClassifier module.
  • Step 4: Divide data Creates the full data set. Makes a shuffle to take them in different order. Then takes 70% as training data and 30% test data.
    • This data is mixed different from run to run. Hence, it might happen that you will not get the same accuracy like I will in my run.
    • The training data is used to make the model to predict from.
    • The test data is used to compute the accuracy of the model to predict.
  • Step 5: Training model This is the training of the NaiveBayesClassifier model.
    • This is where all the magic happens.
  • Step 6: Accuracy This is testing the accuracy of the model.
  • Step 7: Persist To save the model for use.

I got the following output from the above program.

Accuracy is: 0.9973333333333333
Most Informative Features
                      :) = True           Positi : Negati =   1010.7 : 1.0
                     sad = True           Negati : Positi =     25.4 : 1.0
                     bam = True           Positi : Negati =     20.2 : 1.0
                  arrive = True           Positi : Negati =     18.3 : 1.0
                     x15 = True           Negati : Positi =     17.2 : 1.0
               community = True           Positi : Negati =     14.7 : 1.0
                    glad = True           Positi : Negati =     12.6 : 1.0
                   enjoy = True           Positi : Negati =     12.0 : 1.0
                    kill = True           Negati : Positi =     12.0 : 1.0
                     ugh = True           Negati : Positi =     11.3 : 1.0
None

Step 4: Use the Sentiment Analysis prediction model

Now we can determine the mood of a tweet. To have some fun let us try to figure out the mood of tweets with Python and compare it with Java.

To do that, you need to have setup your twitter developer account. If you do not have that already, then see the this tutorial on how to do that.

In the code below you need to fill out your consumer_key, consumer_secret, access_token, and access_token_secret.

import pickle
import tweepy


def get_twitter_api():
    # personal details
    consumer_key = "___INSERT YOUR DATA HERE___"
    consumer_secret = "___INSERT YOUR DATA HERE___"
    access_token = "___INSERT YOUR DATA HERE___"
    access_token_secret = "___INSERT YOUR DATA HERE___"

    # authentication of consumer key and secret
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

    # authentication of access token and secret
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth)
    return api


# This function uses the functions from the learner code above
def tokenize(tweet):
    return remove_noise(word_tokenize(tweet))


def get_classifier(pickle_name):
    f = open(pickle_name, 'rb')
    classifier = pickle.load(f)
    f.close()
    return classifier


def find_mood(search):
    classifier = get_classifier('my_classifier.pickle')

    api = get_twitter_api()

    stat = {
        "Positive": 0,
        "Negative": 0
    }
    for tweet in tweepy.Cursor(api.search, q=search).items(1000):
        custom_tokens = tokenize(tweet.text)

        category = classifier.classify(dict([token, True] for token in custom_tokens))
        stat[category] += 1

    print("The mood of", search)
    print(" - Positive", stat["Positive"], round(stat["Positive"]*100/(stat["Positive"] + stat["Negative"]), 1))
    print(" - Negative", stat["Negative"], round(stat["Negative"]*100/(stat["Positive"] + stat["Negative"]), 1))


if __name__ == "__main__":
    find_mood("#java")
    find_mood("#python")

That is it. Obviously the mood of Python is better. It is easier than Java.

The mood of #java
 - Positive 524 70.4
 - Negative 220 29.6
The mood of #python
 - Positive 753 75.3
 - Negative 247 24.7

If you want to learn more about Python I can encourage you to take my course here.

A Simple 7 Step Guide to Implement a Prediction Model to Filter Tweets Based on Dataset Interactively Read from Twitter

What will we learn in this tutorial

  • How Machine Learning works and predicts.
  • What you need to install to implement your Prediction Model in Python
  • A simple way to implement a Prediction Model in Python with persistence
  • How to simplify the connection to the Twitter API using tweepy
  • Collect the training dataset from twitter interactively in a Python program
  • Use the persistent model to predict the tweets you like

Step 1: Quick introduction to Machine Learning

Machine Learning: Input to Learner is Features X (data set) with Targets Y. The Learner outputs a Model, which can predict (Y) future inputs (X).
Machine Learning: Input to Learner is Features X (data set) with Targets Y. The Learner outputs a Model, which can predict (Y) future inputs (X).
  • The Leaner (or Machine Learning Algorithm) is the program that creates a machine learning model from the input data.
  • The Features X is the dataset used by the Learner to generate the Model.
  • The Target Y contains the categories for each data item in the Feature X dataset.
  • The Model takes new inputs X (similar to those in Features) and predicts a target Y, from the categories in Target Y.

We will implement a simple model, that can predict Twitter feeds into two categories: allow and refuse.

Step 2: Install sklearn library (skip if you already have it)

The Python code will be using the sklearn library.

You can install it, simply write the following in the command line (also see here).

pip install scikit-learn

Alternatively, you might want to install it locally in your user space.

pip install scikit-learn --user

Step 3: Create a simple Prediction Model in Python to Train and Predict on tweets

The implementation accomplishes the the machine learning model in a class. The class has the following features.

  • create_dataset: It creates a dataset by taking a list of data that are representing allow, and a list of data that represent the reject. The dataset is divided into features and targets
  • train_dataset: When your dataset is loaded it should be trained to create the model, consisting of the predictor (transfer and estimator)
  • predict: Is called after the model is trained. It can predict an input if it is in the allow category.
  • persist: Is called to save the model for later use, such that we do not need to collect data and train it again. It should only be called after dataset has been created and the model has been train (after create_dataset and train_dataset)
  • load: This will load a saved model and be ready to predict new input.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import joblib


class PredictionModel:
    def __init__(self):
        self.predictor = {}
        self.dataset = {'features': [], 'targets': []}
        self.allow_id = 0
        self.reject_id = 1

    def create_dataset(self, allow_data, reject_data):
        features_y = allow_data + reject_data
        targets_x = [self.allow_id]*len(allow_data) + [self.reject_id]*len(reject_data)
        self.dataset = {'features': features_y, 'targets': targets_x}

    def train_dataset(self):
        x_train, x_test, y_train, y_test = train_test_split(self.dataset['features'], self.dataset['targets'])

        transfer = TfidfVectorizer()
        x_train = transfer.fit_transform(x_train)
        x_test = transfer.transform(x_test)

        estimator = MultinomialNB()
        estimator.fit(x_train, y_train)

        score = estimator.score(x_test, y_test)
        self.predictor = {'transfer': transfer, 'estimator': estimator}

    def predict(self, text):
        sentence_x = self.predictor['transfer'].transform()
        y_predict = self.predictor['estimator'].predict(sentence_x)
        return y_predict[0] == self.allow_id

    def persist(self, output_name):
        joblib.dump(self.predictor['transfer'], output_name+".transfer")
        joblib.dump(self.predictor['estimator'], output_name+".estimator")

    def load(self, input_name):
        self.predictor['transfer'] = joblib.load(input_name+'.transfer')
        self.predictor['estimator'] = joblib.load(input_name+'.estimator')

Step 4: Get a Twitter API access

Go to https://developer.twitter.com/en and get your consumer_key, consumer_secret, access_token, and access_token_secret.

api_key = {
    'consumer_key': "",
    'consumer_secret': "",
    'access_token': "",
    'access_token_secret': ""
}

Also see here for a deeper tutorial on how to get them if in doubt.

Step 5: Simplify your Twitter connection

If you do not already have the tweepy library, then install it by.

pip install tweepy

As you will only read tweets from users, the following class will help you to simplify your code.

import tweepy


class TwitterConnection:
    def __init__(self, api_key):
        # authentication of consumer key and secret
        auth = tweepy.OAuthHandler(api_key['consumer_key'], api_key['consumer_secret'])

        # authentication of access token and secret
        auth.set_access_token(api_key['access_token'], api_key['access_token_secret'])
        self.api = tweepy.API(auth)

    def get_tweets(self, user_name, number=0):
        if number > 0:
            return tweepy.Cursor(self.api.user_timeline, screen_name=user_name, tweet_mode="extended").items(number)
        else:
            return tweepy.Cursor(self.api.user_timeline, screen_name=user_name, tweet_mode="extended").items()
  • __init__: The class sets up the Twitter API in the init-function.
  • get_tweets: Returns the tweets from a user_name (screen_name).

Step 6: Collect the dataset (Features X and Target Y) from Twitter

To simplify your life you will use the above TwitterConnection class and and PredictionModel class.

def get_features(auth, user_name, output_name):
    positives = []
    negatives = []
    twitter_con = TwitterConnection(auth)
    tweets = twitter_con.get_tweets(user_name)
    for tweet in tweets:
        print(tweet.full_text)
        print("a/r/e (allow/reject/end)? ", end='')
        response = input()
        if response.lower() == 'y':
            positives.append(tweet.full_text)
        elif response.lower() == 'e':
            break
        else:
            negatives.append(tweet.full_text)
    model = PredictionModel()
    model.create_dataset(positives, negatives)
    model.train_dataset()
    model.persist(output_name)

The function reads the tweets from user_name and prompts for each one of them whether it should be added to tweets you allow or reject.

When you do not feel like “training” your set more (i.e. collect more training data), then you can press e.

Then it will create the dataset and train it to finally persist it.

Step 7: See how good it predicts your tweets based on your model

The following code will print the first number tweets that your model will allow by user_name.

def fetch_tweets_prediction(auth, user_name, input_name, number):
    model = PredictionModel()
    model.load(input_name)
    twitter_con = TwitterConnection(auth)
    tweets = twitter_con.get_tweets(user_name)
    for tweet in tweets:
        if model.predict(tweet.full_text):
            print(tweet.full_text)
            number -= 1
        if number < 0:
            break

Then your final piece is to call it. Remember to fill out your values for the api_key.

api_key = {
    'consumer_key': "",
    'consumer_secret': "",
    'access_token': "",
    'access_token_secret': ""
}

get_features(api_key, "@cnnbrk", "cnnbrk")
fetch_tweets_prediction(api_key, "@cnnbrk", "cnnbrk", 10)

Conclusion

I trained my set by 30-40 tweets with the above code. From the training set it did not have any false positives (that is an allow which was a reject int eh dataset), but it did have false rejects.

The full code is here.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import joblib
import tweepy


class PredictionModel:
    def __init__(self):
        self.predictor = {}
        self.dataset = {'features': [], 'targets': []}
        self.allow_id = 0
        self.reject_id = 1

    def create_dataset(self, allow_data, reject_data):
        features_y = allow_data + reject_data
        targets_x = [self.allow_id]*len(allow_data) + [self.reject_id]*len(reject_data)
        self.dataset = {'features': features_y, 'targets': targets_x}

    def train_dataset(self):
        x_train, x_test, y_train, y_test = train_test_split(self.dataset['features'], self.dataset['targets'])

        transfer = TfidfVectorizer()
        x_train = transfer.fit_transform(x_train)
        x_test = transfer.transform(x_test)

        estimator = MultinomialNB()
        estimator.fit(x_train, y_train)

        score = estimator.score(x_test, y_test)
        self.predictor = {'transfer': transfer, 'estimator': estimator}

    def predict(self, text):
        sentence_x = self.predictor['transfer'].transform()
        y_predict = self.predictor['estimator'].predict(sentence_x)
        return y_predict[0] == self.allow_id

    def persist(self, output_name):
        joblib.dump(self.predictor['transfer'], output_name+".transfer")
        joblib.dump(self.predictor['estimator'], output_name+".estimator")

    def load(self, input_name):
        self.predictor['transfer'] = joblib.load(input_name+'.transfer')
        self.predictor['estimator'] = joblib.load(input_name+'.estimator')


class TwitterConnection:
    def __init__(self, api_key):
        # authentication of consumer key and secret
        auth = tweepy.OAuthHandler(api_key['consumer_key'], api_key['consumer_secret'])

        # authentication of access token and secret
        auth.set_access_token(api_key['access_token'], api_key['access_token_secret'])
        self.api = tweepy.API(auth)

    def get_tweets(self, user_name, number=0):
        if number > 0:
            return tweepy.Cursor(self.api.user_timeline, screen_name=user_name, tweet_mode="extended").items(number)
        else:
            return tweepy.Cursor(self.api.user_timeline, screen_name=user_name, tweet_mode="extended").items()


def get_features(auth, user_name, output_name):
    positives = []
    negatives = []
    twitter_con = TwitterConnection(auth)
    tweets = twitter_con.get_tweets(user_name)
    for tweet in tweets:
        print(tweet.full_text)
        print("y/n/e (positive/negative/end)? ", end='')
        response = input()
        if response.lower() == 'y':
            positives.append(tweet.full_text)
        elif response.lower() == 'e':
            break
        else:
            negatives.append(tweet.full_text)
    model = PredictionModel()
    model.create_dataset(positives, negatives)
    model.train_dataset()
    model.persist(output_name)


def fetch_tweets_prediction(auth, user_name, input_name, number):
    model = PredictionModel()
    model.load(input_name)
    twitter_con = TwitterConnection(auth)
    tweets = twitter_con.get_tweets(user_name)
    for tweet in tweets:
        if model.predict(tweet.full_text):
            print("POS", tweet.full_text)
            number -= 1
        else:
            pass
            # print("NEG", tweet.full_text)
        if number < 0:
            break

api_key = {
    'consumer_key': "_",
    'consumer_secret': "_",
    'access_token': "_-_",
    'access_token_secret': "_"
}

get_features(api_key, "@cnnbrk", "cnnbrk")
fetch_tweets_prediction(api_key, "@cnnbrk", "cnnbrk", 10)