Learn how you can become a Python programmer in just 12 weeks.

    We respect your privacy. Unsubscribe at anytime.

    How to Create a Sentiment Analysis model to Predict the Mood of Tweets with Python – 4 Steps to Compare the Mood of Python vs Java

    What will we cover in this tutorial?

    • We will learn how the supervised Machine Learning algorithm Sentiment Analysis can be used on twitter data (also, called tweets).
    • The model we use will be Naive Bayes Classifier.
    • The tutorial will help install the necessary Python libraries to get started and how to download training data.
    • Then it will give you a full script to train the model.
    • Finally, we will use the trained model to compare the “mood” of Python with Java.

    Step 1: Install the Natural Language Toolkit Library and Download Collections

    We will use the Natural Language Toolkit (nltk) library in this tutorial.

    NLTK is a leading platform for building Python programs to work with human language data.

    http://www.nltk.org

    To install the library you should run the following command in a terminal or see here for other alternatives.

    pip install nltk
    

    To have the data available that you need to run the following program or see installing NLTK Data.

    import nltk
    nltk.download()
    

    This will prompt you with a screen similar to this. And select all packages you want to install (I took them all).

    Download all packages to NLTK (Natural Language Toolkit)
    Download all packages to NLTK (Natural Language Toolkit)

    After download you can use the twitter_samples as you need in the example.

    Step 2: Reminder of the Sentiment Analysis learning process (Machine Learning)

    On a high level you can divide Machine Learning into two phases.

    • Phase 1: Learning
    • Phase 2: Prediction

    The Sentiment Analysis model is supervised learning process. The process is defined in the picture below.

    The Sentiment Analysis model (Machine Learning) Learning phase
    The Sentiment Analysis model (Supervised Machine Learning) Learning phase

    On a high level the the learning process of Sentiment Analysis model has the following steps.

    • Training & test data
      • The Sentiment Analysis model is a supervised learning and needs data representing the data that the model should predict. We will use tweets.
      • The data should be categorized into the groups it should be able to distinguish. In our example it will be in positive tweets and negative tweets.
    • Pre-processing
      • First you need to remove “noise”. In our case we remove URL links and Twitter user names.
      • Then you Lemmatize the data to have the words in the same form.
      • Further, you remove stop words as they have no impact of the mood in the tweet.
      • The data then needs to be formatted for the algorithm.
      • Finally, you need to divide it into a training data and testing data.
    • Learning
      • This is where the algorithm builds the model using the training data.
    • Testing
      • Then we test the accuracy of the model with the categorized test data.

    Step 3: Train the Sample Data

    The twitter_sample contains 5000 positive and 5000 negative tweets, all ready and classified to use in for your training model.

    import random
    import pickle
    from nltk.corpus import twitter_samples
    from nltk.stem import WordNetLemmatizer
    from nltk.tag import pos_tag
    from nltk.corpus import stopwords
    from nltk import NaiveBayesClassifier
    from nltk import classify
    
    def clean_data(token):
        return [item for item in token if not item.startswith("http") and not item.startswith("@")]
    
    def lemmatization(token):
        lemmatizer = WordNetLemmatizer()
        result = []
        for token, tag in pos_tag(token):
            tag = tag[0].lower()
            token = token.lower()
            if tag in "nva":
                result.append(lemmatizer.lemmatize(token, pos=tag))
            else:
                result.append(lemmatizer.lemmatize(token))
        return result
    
    def remove_stop_words(token, stop_words):
        return [item for item in token if item not in stop_words]
    
    def transform(token):
        result = {}
        for item in token:
            result[item] = True
        return result
    
    def main():
        # Step 1: Gather data
        positive_tweets_tokens = twitter_samples.tokenized('positive_tweets.json')
        negative_tweets_tokens = twitter_samples.tokenized('negative_tweets.json')
        # Step 2: Clean, Lemmatize, and remove Stop Words
        stop_words = stopwords.words('english')
        positive_tweets_tokens_cleaned = [remove_stop_words(lemmatization(clean_data(token)), stop_words) for token in positive_tweets_tokens]
        negative_tweets_tokens_cleaned = [remove_stop_words(lemmatization(clean_data(token)), stop_words) for token in negative_tweets_tokens]
        # Step 3: Transform data
        positive_tweets_tokens_transformed = [(transform(token), "Positive") for token in positive_tweets_tokens_cleaned]
        negative_tweets_tokens_transformed = [(transform(token), "Negative") for token in negative_tweets_tokens_cleaned]
    
        # Step 4: Create data set
        dataset = positive_tweets_tokens_transformed + negative_tweets_tokens_transformed
        random.shuffle(dataset)
        train_data = dataset[:7000]
        test_data = dataset[7000:]
        # Step 5: Train data
        classifier = NaiveBayesClassifier.train(train_data)
        # Step 6: Test accuracy
        print("Accuracy is:", classify.accuracy(classifier, test_data))
        print(classifier.show_most_informative_features(10))
        # Step 7: Save the pickle
        f = open('my_classifier.pickle', 'wb')
        pickle.dump(classifier, f)
        f.close()
    
    if __name__ == "__main__":
        main()
    

    The code is structured in steps. If you are not comfortable how a the flow of a general machine learning flow is, I can recommend to read this tutorial here or this one.

    • Step 1: Collect and categorize It reads the 5000 positive and 5000 negative twitter samples we downloaded with the nltk.download() call.
    • Step 2: The data needs to be cleaned, Lemmatized and removed for stop words.
      • The clean_data call removes links and twitter users.
      • The call to lemmatization puts words in their base form.
      • The call to remove_stop_words removes all the stop words that have no affect on the mood of the sentence.
    • Step 3: Format data This step transforms the data to the desired format for the NaiveBayesClassifier module.
    • Step 4: Divide data Creates the full data set. Makes a shuffle to take them in different order. Then takes 70% as training data and 30% test data.
      • This data is mixed different from run to run. Hence, it might happen that you will not get the same accuracy like I will in my run.
      • The training data is used to make the model to predict from.
      • The test data is used to compute the accuracy of the model to predict.
    • Step 5: Training model This is the training of the NaiveBayesClassifier model.
      • This is where all the magic happens.
    • Step 6: Accuracy This is testing the accuracy of the model.
    • Step 7: Persist To save the model for use.

    I got the following output from the above program.

    Accuracy is: 0.9973333333333333
    Most Informative Features
                          :) = True           Positi : Negati =   1010.7 : 1.0
                         sad = True           Negati : Positi =     25.4 : 1.0
                         bam = True           Positi : Negati =     20.2 : 1.0
                      arrive = True           Positi : Negati =     18.3 : 1.0
                         x15 = True           Negati : Positi =     17.2 : 1.0
                   community = True           Positi : Negati =     14.7 : 1.0
                        glad = True           Positi : Negati =     12.6 : 1.0
                       enjoy = True           Positi : Negati =     12.0 : 1.0
                        kill = True           Negati : Positi =     12.0 : 1.0
                         ugh = True           Negati : Positi =     11.3 : 1.0
    None
    

    Step 4: Use the Sentiment Analysis prediction model

    Now we can determine the mood of a tweet. To have some fun let us try to figure out the mood of tweets with Python and compare it with Java.

    To do that, you need to have setup your twitter developer account. If you do not have that already, then see the this tutorial on how to do that.

    In the code below you need to fill out your consumer_key, consumer_secret, access_token, and access_token_secret.

    import pickle
    import tweepy
    
    def get_twitter_api():
        # personal details
        consumer_key = "___INSERT YOUR DATA HERE___"
        consumer_secret = "___INSERT YOUR DATA HERE___"
        access_token = "___INSERT YOUR DATA HERE___"
        access_token_secret = "___INSERT YOUR DATA HERE___"
        # authentication of consumer key and secret
        auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
        # authentication of access token and secret
        auth.set_access_token(access_token, access_token_secret)
        api = tweepy.API(auth)
        return api
    
    # This function uses the functions from the learner code above
    def tokenize(tweet):
        return remove_noise(word_tokenize(tweet))
    
    def get_classifier(pickle_name):
        f = open(pickle_name, 'rb')
        classifier = pickle.load(f)
        f.close()
        return classifier
    
    def find_mood(search):
        classifier = get_classifier('my_classifier.pickle')
        api = get_twitter_api()
        stat = {
            "Positive": 0,
            "Negative": 0
        }
        for tweet in tweepy.Cursor(api.search, q=search).items(1000):
            custom_tokens = tokenize(tweet.text)
            category = classifier.classify(dict([token, True] for token in custom_tokens))
            stat[category] += 1
        print("The mood of", search)
        print(" - Positive", stat["Positive"], round(stat["Positive"]*100/(stat["Positive"] + stat["Negative"]), 1))
        print(" - Negative", stat["Negative"], round(stat["Negative"]*100/(stat["Positive"] + stat["Negative"]), 1))
    
    if __name__ == "__main__":
        find_mood("#java")
        find_mood("#python")
    

    That is it. Obviously the mood of Python is better. It is easier than Java.

    The mood of #java
     - Positive 524 70.4
     - Negative 220 29.6
    The mood of #python
     - Positive 753 75.3
     - Negative 247 24.7
    

    If you want to learn more about Python I can encourage you to take my course here.

    Python for Finance: Unlock Financial Freedom and Build Your Dream Life

    Discover the key to financial freedom and secure your dream life with Python for Finance!

    Say goodbye to financial anxiety and embrace a future filled with confidence and success. If you’re tired of struggling to pay bills and longing for a life of leisure, it’s time to take action.

    Imagine breaking free from that dead-end job and opening doors to endless opportunities. With Python for Finance, you can acquire the invaluable skill of financial analysis that will revolutionize your life.

    Make informed investment decisions, unlock the secrets of business financial performance, and maximize your money like never before. Gain the knowledge sought after by companies worldwide and become an indispensable asset in today’s competitive market.

    Don’t let your dreams slip away. Master Python for Finance and pave your way to a profitable and fulfilling career. Start building the future you deserve today!

    Python for Finance a 21 hours course that teaches investing with Python.

    Learn pandas, NumPy, Matplotlib for Financial Analysis & learn how to Automate Value Investing.

    “Excellent course for anyone trying to learn coding and investing.” – Lorenzo B.

    Leave a Comment