We will use the Natural Language Toolkit (nltk) library in this tutorial.
NLTK is a leading platform for building Python programs to work with human language data.
http://www.nltk.org
To install the library you should run the following command in a terminal or see here for other alternatives.
pip install nltk
To have the data available that you need to run the following program or see installing NLTK Data.
import nltk
nltk.download()
This will prompt you with a screen similar to this. And select all packages you want to install (I took them all).
After download you can use the twitter_samples as you need in the example.
On a high level you can divide Machine Learning into two phases.
The Sentiment Analysis model is supervised learning process. The process is defined in the picture below.
On a high level the the learning process of Sentiment Analysis model has the following steps.
The twitter_sample contains 5000 positive and 5000 negative tweets, all ready and classified to use in for your training model.
import random
import pickle
from nltk.corpus import twitter_samples
from nltk.stem import WordNetLemmatizer
from nltk.tag import pos_tag
from nltk.corpus import stopwords
from nltk import NaiveBayesClassifier
from nltk import classify
def clean_data(token):
return [item for item in token if not item.startswith("http") and not item.startswith("@")]
def lemmatization(token):
lemmatizer = WordNetLemmatizer()
result = []
for token, tag in pos_tag(token):
tag = tag[0].lower()
token = token.lower()
if tag in "nva":
result.append(lemmatizer.lemmatize(token, pos=tag))
else:
result.append(lemmatizer.lemmatize(token))
return result
def remove_stop_words(token, stop_words):
return [item for item in token if item not in stop_words]
def transform(token):
result = {}
for item in token:
result[item] = True
return result
def main():
# Step 1: Gather data
positive_tweets_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tweets_tokens = twitter_samples.tokenized('negative_tweets.json')
# Step 2: Clean, Lemmatize, and remove Stop Words
stop_words = stopwords.words('english')
positive_tweets_tokens_cleaned = [remove_stop_words(lemmatization(clean_data(token)), stop_words) for token in positive_tweets_tokens]
negative_tweets_tokens_cleaned = [remove_stop_words(lemmatization(clean_data(token)), stop_words) for token in negative_tweets_tokens]
# Step 3: Transform data
positive_tweets_tokens_transformed = [(transform(token), "Positive") for token in positive_tweets_tokens_cleaned]
negative_tweets_tokens_transformed = [(transform(token), "Negative") for token in negative_tweets_tokens_cleaned]
# Step 4: Create data set
dataset = positive_tweets_tokens_transformed + negative_tweets_tokens_transformed
random.shuffle(dataset)
train_data = dataset[:7000]
test_data = dataset[7000:]
# Step 5: Train data
classifier = NaiveBayesClassifier.train(train_data)
# Step 6: Test accuracy
print("Accuracy is:", classify.accuracy(classifier, test_data))
print(classifier.show_most_informative_features(10))
# Step 7: Save the pickle
f = open('my_classifier.pickle', 'wb')
pickle.dump(classifier, f)
f.close()
if __name__ == "__main__":
main()
The code is structured in steps. If you are not comfortable how a the flow of a general machine learning flow is, I can recommend to read this tutorial here or this one.
I got the following output from the above program.
Accuracy is: 0.9973333333333333
Most Informative Features
:) = True Positi : Negati = 1010.7 : 1.0
sad = True Negati : Positi = 25.4 : 1.0
bam = True Positi : Negati = 20.2 : 1.0
arrive = True Positi : Negati = 18.3 : 1.0
x15 = True Negati : Positi = 17.2 : 1.0
community = True Positi : Negati = 14.7 : 1.0
glad = True Positi : Negati = 12.6 : 1.0
enjoy = True Positi : Negati = 12.0 : 1.0
kill = True Negati : Positi = 12.0 : 1.0
ugh = True Negati : Positi = 11.3 : 1.0
None
Now we can determine the mood of a tweet. To have some fun let us try to figure out the mood of tweets with Python and compare it with Java.
To do that, you need to have setup your twitter developer account. If you do not have that already, then see the this tutorial on how to do that.
In the code below you need to fill out your consumer_key, consumer_secret, access_token, and access_token_secret.
import pickle
import tweepy
def get_twitter_api():
# personal details
consumer_key = "___INSERT YOUR DATA HERE___"
consumer_secret = "___INSERT YOUR DATA HERE___"
access_token = "___INSERT YOUR DATA HERE___"
access_token_secret = "___INSERT YOUR DATA HERE___"
# authentication of consumer key and secret
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
# authentication of access token and secret
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
return api
# This function uses the functions from the learner code above
def tokenize(tweet):
return remove_noise(word_tokenize(tweet))
def get_classifier(pickle_name):
f = open(pickle_name, 'rb')
classifier = pickle.load(f)
f.close()
return classifier
def find_mood(search):
classifier = get_classifier('my_classifier.pickle')
api = get_twitter_api()
stat = {
"Positive": 0,
"Negative": 0
}
for tweet in tweepy.Cursor(api.search, q=search).items(1000):
custom_tokens = tokenize(tweet.text)
category = classifier.classify(dict([token, True] for token in custom_tokens))
stat[category] += 1
print("The mood of", search)
print(" - Positive", stat["Positive"], round(stat["Positive"]*100/(stat["Positive"] + stat["Negative"]), 1))
print(" - Negative", stat["Negative"], round(stat["Negative"]*100/(stat["Positive"] + stat["Negative"]), 1))
if __name__ == "__main__":
find_mood("#java")
find_mood("#python")
That is it. Obviously the mood of Python is better. It is easier than Java.
The mood of #java
- Positive 524 70.4
- Negative 220 29.6
The mood of #python
- Positive 753 75.3
- Negative 247 24.7
If you want to learn more about Python I can encourage you to take my course here.
Build and Deploy an AI App with Python Flask, OpenAI API, and Google Cloud: In…
Python REST APIs with gcloud Serverless In the fast-paced world of application development, building robust…
App Development with Python using Docker Are you an aspiring app developer looking to level…
Why Value-driven Data Science is the Key to Your Success In the world of data…
Harnessing the Power of Project-Based Learning and Python for Machine Learning Mastery In today's data-driven…
Is Python the right choice for Machine Learning? Should you learn Python for Machine Learning?…