What will you learn?
- How to predict from a dataset with Machine Learning
- How to implement that in Python
- How to get data from Twitter
- How to install the necessary libraries to do Machine Learning in Python
Step 1: Install the necessary libraries
The sklearn library is a simple and efficient tools for predictive data analysis.
You can install it by typing in the following in your command line.
pip install sklearn
It will most likely install a couple of more needed libraries.
Collecting sklearn Downloading sklearn-0.0.tar.gz (1.1 kB) Collecting scikit-learn Downloading scikit_learn-0.23.1-cp38-cp38-macosx_10_9_x86_64.whl (7.2 MB) |████████████████████████████████| 7.2 MB 5.0 MB/s Collecting numpy>=1.13.3 Downloading numpy-1.18.4-cp38-cp38-macosx_10_9_x86_64.whl (15.2 MB) |████████████████████████████████| 15.2 MB 12.6 MB/s Collecting joblib>=0.11 Downloading joblib-0.15.1-py3-none-any.whl (298 kB) |████████████████████████████████| 298 kB 8.1 MB/s Collecting threadpoolctl>=2.0.0 Downloading threadpoolctl-2.1.0-py3-none-any.whl (12 kB) Collecting scipy>=0.19.1 Downloading scipy-1.4.1-cp38-cp38-macosx_10_9_x86_64.whl (28.8 MB) |████████████████████████████████| 28.8 MB 5.8 MB/s Using legacy setup.py install for sklearn, since package 'wheel' is not installed. Installing collected packages: numpy, joblib, threadpoolctl, scipy, scikit-learn, sklearn Running setup.py install for sklearn ... done Successfully installed joblib-0.15.1 numpy-1.18.4 scikit-learn-0.23.1 scipy-1.4.1 sklearn-0.0 threadpoolctl-2.1.0
As in my installation with numpy, joblib, threadpoolctl, scipy, and scikit-learn.
Step 2: The dataset
The machine learning algorithm needs a dataset to train on. To make this tutorial simple, I only used a limited set. I looked through the top tweets from CNN Breaking and categorised them in positive and negative tweets (I know it can be subjective).
negative = [ "Protesters who were marching from Minneapolis to St. Paul were tear gassed by police as they tried to cross the Lake Street Marshall Bridge ", "The National Guard has been activated in Washington, D.C. to assist police handling protests around the White House", "Police have been firing tear gas at the protesters near the 5th Precinct in Minneapolis, where some in the crowd have responded with projectiles of their own", "Texas and Colorado have activated the National Guard respond to protests", "The mayor of Rochester, New York, has declared a state of emergency and ordered a curfew from 9 p.m. Saturday to 7 a.m. Sunday", "Cleveland, Ohio, has enacted a curfew that will go into effect at 8 p.m. Saturday and last through 8 a.m. Sunday", "A police car appears to be on fire in Los Angeles. Police officers are holding back a line of demonstrators to prevent them from getting close to the car." ] positive = [ "Two NASA astronauts make history with their successful launch into space aboard a SpaceX rocket", "After questionable weather, officials give the all clear for the SpaceX launch", "NASA astronauts Bob Behnken and Doug Hurley climb aboard SpaceX's Crew Dragon spacecraft as they prepare for a mission to the International Space Station", "New York Gov. Andrew Cuomo signs a bill giving death benefits to families of frontline workers who died battling the coronavirus pandemic" ]
Step 3: Train the model
The data needs to be categorised to be fed into the training algorithm. Hence, we will make the required structure of the data set.
def prepare_data(positive, negative): data = positive + negative target = [0]*len(positive) + [1]*len(negative) return {'data': data, 'target': target}
The actual training is done by using the sklearn library.
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.model_selection import train_test_split def train_data_set(data_set): x_train, x_test, y_train, y_test = train_test_split(data_set['data'], data_set['target']) transfer = TfidfVectorizer() x_train = transfer.fit_transform(x_train) x_test = transfer.transform(x_test) estimator = MultinomialNB() estimator.fit(x_train, y_train) score = estimator.score(x_test, y_test) print("score:\n", score) return {'transfer': transfer, 'estimator': estimator}
Step 4: Get some tweets from CNN Breaking and predict
In order for this step to work you need to set up tokens for the twitter api. You can follow this tutorial in order to do that.
When you have that you can use the following code to get it running.
import tweepy def setup_twitter(): consumer_key = "REPLACE WITH YOUR KEY" consumer_secret = "REPLACE WITH YOUR SECRET" access_token = "REPLACE WITH YOUR TOKEN" access_token_secret = "REPLACE WITH YOUR TOKEN SECRET" # authentication of consumer key and secret auth = tweepy.OAuthHandler(consumer_key, consumer_secret) # authentication of access token and secret auth.set_access_token(access_token, access_token_secret) api = tweepy.API(auth) return api def mood_on_cnn(api, predictor): stat = [0, 0] for status in tweepy.Cursor(api.user_timeline, screen_name='@cnnbrk', tweet_mode="extended").items(): sentence_x = predictor['transfer'].transform([status.full_text]) y_predict = predictor['estimator'].predict(sentence_x) stat[y_predict[0]] += 1 return stat
Step 5: Putting it all together
That is it.
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.model_selection import train_test_split import tweepy negative = [ "Protesters who were marching from Minneapolis to St. Paul were tear gassed by police as they tried to cross the Lake Street Marshall Bridge ", "The National Guard has been activated in Washington, D.C. to assist police handling protests around the White House", "Police have been firing tear gas at the protesters near the 5th Precinct in Minneapolis, where some in the crowd have responded with projectiles of their own", "Texas and Colorado have activated the National Guard respond to protests", "The mayor of Rochester, New York, has declared a state of emergency and ordered a curfew from 9 p.m. Saturday to 7 a.m. Sunday", "Cleveland, Ohio, has enacted a curfew that will go into effect at 8 p.m. Saturday and last through 8 a.m. Sunday", "A police car appears to be on fire in Los Angeles. Police officers are holding back a line of demonstrators to prevent them from getting close to the car." ] positive = [ "Two NASA astronauts make history with their successful launch into space aboard a SpaceX rocket", "After questionable weather, officials give the all clear for the SpaceX launch", "NASA astronauts Bob Behnken and Doug Hurley climb aboard SpaceX's Crew Dragon spacecraft as they prepare for a mission to the International Space Station", "New York Gov. Andrew Cuomo signs a bill giving death benefits to families of frontline workers who died battling the coronavirus pandemic" ] def prepare_data(positive, negative): data = positive + negative target = [0]*len(positive) + [1]*len(negative) return {'data': data, 'target': target} def train_data_set(data_set): x_train, x_test, y_train, y_test = train_test_split(data_set['data'], data_set['target']) transfer = TfidfVectorizer() x_train = transfer.fit_transform(x_train) x_test = transfer.transform(x_test) estimator = MultinomialNB() estimator.fit(x_train, y_train) score = estimator.score(x_test, y_test) print("score:\n", score) return {'transfer': transfer, 'estimator': estimator} def setup_twitter(): consumer_key = "REPLACE WITH YOUR KEY" consumer_secret = "REPLACE WITH YOUR SECRET" access_token = "REPLACE WITH YOUR TOKEN" access_token_secret = "REPLACE WITH YOUR TOKEN SECRET" # authentication of consumer key and secret auth = tweepy.OAuthHandler(consumer_key, consumer_secret) # authentication of access token and secret auth.set_access_token(access_token, access_token_secret) api = tweepy.API(auth) return api def mood_on_cnn(api, predictor): stat = [0, 0] for status in tweepy.Cursor(api.user_timeline, screen_name='@cnnbrk', tweet_mode="extended").items(): sentence_x = predictor['transfer'].transform([status.full_text]) y_predict = predictor['estimator'].predict(sentence_x) stat[y_predict[0]] += 1 return stat data_set = prepare_data(positive, negative) predictor = train_data_set(data_set) api = setup_twitter() stat = mood_on_cnn(api, predictor) print(stat) print("Mood (0 good, 1 bad)", stat[1]/(stat[0] + stat[1]))
I got the following output on the day of writing this tutorial.
score: 1.0 [751, 2455] Mood (0 good, 1 bad) 0.765751715533375
I found that the breaking news items are quite negative in taste. Hence, it seems to predict that.