What will we learn in this tutorial
- How Machine Learning works and predicts.
- What you need to install to implement your Prediction Model in Python
- A simple way to implement a Prediction Model in Python with persistence
- How to simplify the connection to the Twitter API using tweepy
- Collect the training dataset from twitter interactively in a Python program
- Use the persistent model to predict the tweets you like
Step 1: Quick introduction to Machine Learning

- The Leaner (or Machine Learning Algorithm) is the program that creates a machine learning model from the input data.
- The Features X is the dataset used by the Learner to generate the Model.
- The Target Y contains the categories for each data item in the Feature X dataset.
- The Model takes new inputs X (similar to those in Features) and predicts a target Y, from the categories in Target Y.
We will implement a simple model, that can predict Twitter feeds into two categories: allow and refuse.
Step 2: Install sklearn library (skip if you already have it)
The Python code will be using the sklearn library.
You can install it, simply write the following in the command line (also see here).
pip install scikit-learn
Alternatively, you might want to install it locally in your user space.
pip install scikit-learn --user
Step 3: Create a simple Prediction Model in Python to Train and Predict on tweets
The implementation accomplishes the the machine learning model in a class. The class has the following features.
- create_dataset: It creates a dataset by taking a list of data that are representing allow, and a list of data that represent the reject. The dataset is divided into features and targets
- train_dataset: When your dataset is loaded it should be trained to create the model, consisting of the predictor (transfer and estimator)
- predict: Is called after the model is trained. It can predict an input if it is in the allow category.
- persist: Is called to save the model for later use, such that we do not need to collect data and train it again. It should only be called after dataset has been created and the model has been train (after create_dataset and train_dataset)
- load: This will load a saved model and be ready to predict new input.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import joblib
class PredictionModel:
def __init__(self):
self.predictor = {}
self.dataset = {'features': [], 'targets': []}
self.allow_id = 0
self.reject_id = 1
def create_dataset(self, allow_data, reject_data):
features_y = allow_data + reject_data
targets_x = [self.allow_id]*len(allow_data) + [self.reject_id]*len(reject_data)
self.dataset = {'features': features_y, 'targets': targets_x}
def train_dataset(self):
x_train, x_test, y_train, y_test = train_test_split(self.dataset['features'], self.dataset['targets'])
transfer = TfidfVectorizer()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
estimator = MultinomialNB()
estimator.fit(x_train, y_train)
score = estimator.score(x_test, y_test)
self.predictor = {'transfer': transfer, 'estimator': estimator}
def predict(self, text):
sentence_x = self.predictor['transfer'].transform([text])
y_predict = self.predictor['estimator'].predict(sentence_x)
return y_predict[0] == self.allow_id
def persist(self, output_name):
joblib.dump(self.predictor['transfer'], output_name+".transfer")
joblib.dump(self.predictor['estimator'], output_name+".estimator")
def load(self, input_name):
self.predictor['transfer'] = joblib.load(input_name+'.transfer')
self.predictor['estimator'] = joblib.load(input_name+'.estimator')
Step 4: Get a Twitter API access
Go to https://developer.twitter.com/en and get your consumer_key, consumer_secret, access_token, and access_token_secret.
api_key = {
'consumer_key': "",
'consumer_secret': "",
'access_token': "",
'access_token_secret': ""
}
Also see here for a deeper tutorial on how to get them if in doubt.
Step 5: Simplify your Twitter connection
If you do not already have the tweepy library, then install it by.
pip install tweepy
As you will only read tweets from users, the following class will help you to simplify your code.
import tweepy
class TwitterConnection:
def __init__(self, api_key):
# authentication of consumer key and secret
auth = tweepy.OAuthHandler(api_key['consumer_key'], api_key['consumer_secret'])
# authentication of access token and secret
auth.set_access_token(api_key['access_token'], api_key['access_token_secret'])
self.api = tweepy.API(auth)
def get_tweets(self, user_name, number=0):
if number > 0:
return tweepy.Cursor(self.api.user_timeline, screen_name=user_name, tweet_mode="extended").items(number)
else:
return tweepy.Cursor(self.api.user_timeline, screen_name=user_name, tweet_mode="extended").items()
- __init__: The class sets up the Twitter API in the init-function.
- get_tweets: Returns the tweets from a user_name (screen_name).
Step 6: Collect the dataset (Features X and Target Y) from Twitter
To simplify your life you will use the above TwitterConnection class and and PredictionModel class.
def get_features(auth, user_name, output_name):
positives = []
negatives = []
twitter_con = TwitterConnection(auth)
tweets = twitter_con.get_tweets(user_name)
for tweet in tweets:
print(tweet.full_text)
print("a/r/e (allow/reject/end)? ", end='')
response = input()
if response.lower() == 'y':
positives.append(tweet.full_text)
elif response.lower() == 'e':
break
else:
negatives.append(tweet.full_text)
model = PredictionModel()
model.create_dataset(positives, negatives)
model.train_dataset()
model.persist(output_name)
The function reads the tweets from user_name and prompts for each one of them whether it should be added to tweets you allow or reject.
When you do not feel like “training” your set more (i.e. collect more training data), then you can press e.
Then it will create the dataset and train it to finally persist it.
Step 7: See how good it predicts your tweets based on your model
The following code will print the first number tweets that your model will allow by user_name.
def fetch_tweets_prediction(auth, user_name, input_name, number):
model = PredictionModel()
model.load(input_name)
twitter_con = TwitterConnection(auth)
tweets = twitter_con.get_tweets(user_name)
for tweet in tweets:
if model.predict(tweet.full_text):
print(tweet.full_text)
number -= 1
if number < 0:
break
Then your final piece is to call it. Remember to fill out your values for the api_key.
api_key = {
'consumer_key': "",
'consumer_secret': "",
'access_token': "",
'access_token_secret': ""
}
get_features(api_key, "@cnnbrk", "cnnbrk")
fetch_tweets_prediction(api_key, "@cnnbrk", "cnnbrk", 10)
Conclusion
I trained my set by 30-40 tweets with the above code. From the training set it did not have any false positives (that is an allow which was a reject int eh dataset), but it did have false rejects.
The full code is here.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import joblib
import tweepy
class PredictionModel:
def __init__(self):
self.predictor = {}
self.dataset = {'features': [], 'targets': []}
self.allow_id = 0
self.reject_id = 1
def create_dataset(self, allow_data, reject_data):
features_y = allow_data + reject_data
targets_x = [self.allow_id]*len(allow_data) + [self.reject_id]*len(reject_data)
self.dataset = {'features': features_y, 'targets': targets_x}
def train_dataset(self):
x_train, x_test, y_train, y_test = train_test_split(self.dataset['features'], self.dataset['targets'])
transfer = TfidfVectorizer()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
estimator = MultinomialNB()
estimator.fit(x_train, y_train)
score = estimator.score(x_test, y_test)
self.predictor = {'transfer': transfer, 'estimator': estimator}
def predict(self, text):
sentence_x = self.predictor['transfer'].transform([text])
y_predict = self.predictor['estimator'].predict(sentence_x)
return y_predict[0] == self.allow_id
def persist(self, output_name):
joblib.dump(self.predictor['transfer'], output_name+".transfer")
joblib.dump(self.predictor['estimator'], output_name+".estimator")
def load(self, input_name):
self.predictor['transfer'] = joblib.load(input_name+'.transfer')
self.predictor['estimator'] = joblib.load(input_name+'.estimator')
class TwitterConnection:
def __init__(self, api_key):
# authentication of consumer key and secret
auth = tweepy.OAuthHandler(api_key['consumer_key'], api_key['consumer_secret'])
# authentication of access token and secret
auth.set_access_token(api_key['access_token'], api_key['access_token_secret'])
self.api = tweepy.API(auth)
def get_tweets(self, user_name, number=0):
if number > 0:
return tweepy.Cursor(self.api.user_timeline, screen_name=user_name, tweet_mode="extended").items(number)
else:
return tweepy.Cursor(self.api.user_timeline, screen_name=user_name, tweet_mode="extended").items()
def get_features(auth, user_name, output_name):
positives = []
negatives = []
twitter_con = TwitterConnection(auth)
tweets = twitter_con.get_tweets(user_name)
for tweet in tweets:
print(tweet.full_text)
print("y/n/e (positive/negative/end)? ", end='')
response = input()
if response.lower() == 'y':
positives.append(tweet.full_text)
elif response.lower() == 'e':
break
else:
negatives.append(tweet.full_text)
model = PredictionModel()
model.create_dataset(positives, negatives)
model.train_dataset()
model.persist(output_name)
def fetch_tweets_prediction(auth, user_name, input_name, number):
model = PredictionModel()
model.load(input_name)
twitter_con = TwitterConnection(auth)
tweets = twitter_con.get_tweets(user_name)
for tweet in tweets:
if model.predict(tweet.full_text):
print("POS", tweet.full_text)
number -= 1
else:
pass
# print("NEG", tweet.full_text)
if number < 0:
break
api_key = {
'consumer_key': "_",
'consumer_secret': "_",
'access_token': "_-_",
'access_token_secret': "_"
}
get_features(api_key, "@cnnbrk", "cnnbrk")
fetch_tweets_prediction(api_key, "@cnnbrk", "cnnbrk", 10)
Python Circle
Do you know what the 5 key success factors every programmer must have?
How is it possible that some people become programmer so fast?
While others struggle for years and still fail.
Not only do they learn python 10 times faster they solve complex problems with ease.
What separates them from the rest?
I identified these 5 success factors that every programmer must have to succeed:
- Collaboration: sharing your work with others and receiving help with any questions or challenges you may have.
- Networking: the ability to connect with the right people and leverage their knowledge, experience, and resources.
- Support: receive feedback on your work and ask questions without feeling intimidated or judged.
- Accountability: stay motivated and accountable to your learning goals by surrounding yourself with others who are also committed to learning Python.
- Feedback from the instructor: receiving feedback and support from an instructor with years of experience in the field.
I know how important these success factors are for growth and progress in mastering Python.
That is why I want to make them available to anyone struggling to learn or who just wants to improve faster.
With the Python Circle community, you can take advantage of 5 key success factors every programmer must have.

Be part of something bigger and join the Python Circle community.