Why it’s great to master Text Categorization?
Mastering Text Categorization offers several advantages in the field of natural language processing and text analysis:
- Efficient classification: Text Categorization techniques allow for the automatic categorization and organization of large volumes of text data, making it easier to manage and retrieve information based on predefined categories.
- Information extraction: By accurately categorizing text documents, Text Categorization enables the extraction of valuable insights and knowledge from unstructured data, facilitating decision-making processes and data-driven insights.
- Personalized recommendations: Text Categorization can be used to develop recommendation systems that provide personalized content or suggestions based on the categorization of user preferences and interests.
- Streamlined information retrieval: Effective categorization helps in building efficient search systems, enabling users to quickly find relevant documents or information based on predefined categories.
What will be covered in this tutorial?
In this tutorial on Text Categorization, we will cover the following topics:
- Understanding Text Categorization: Exploring the concept and significance of Text Categorization in organizing and classifying text data based on predefined categories.
- The Bag-of-Words Model: Learning about the Bag-of-Words representation, a commonly used model in Text Categorization that treats each document as a collection of words without considering word order.
- Naive Bayes’ Rule: Understanding the principles of Naive Bayes’ Rule, a probabilistic classifier used in Text Categorization to assign documents to specific categories based on the conditional probability of words appearing in each category.
- Using Naive Bayes’ Rule for sentiment classification: Applying Naive Bayes’ Rule specifically for sentiment classification, which involves categorizing text based on positive, negative, or neutral sentiments expressed.
- Problem smoothing: Exploring the concept of problem smoothing in Text Categorization, which helps address issues such as zero probabilities and improves the accuracy of classification models.
By mastering these concepts and techniques, you will gain valuable skills in efficiently categorizing and organizing text data, enabling better information retrieval, knowledge extraction, and personalized recommendations based on predefined categories.
Step 1: What is Text Categorization?
Text categorization (a.k.a. text classification) is the task of assigning predefined categories to free-text documents. It can provide conceptual views of document collections and has important applications in the real world.
http://www.scholarpedia.org/article/Text_categorization
Exampels of Text Categorization includes.
- Inbox vs Spam
- Product review: Positive vs Negtive review
Step 2: What is the Bag-of-Words model?
We have already learned from Context-Free Grammars, that understanding the full structure of language is not efficient or even possible for Natural Language Processing. One approach was to look at trigrams (3 consecutive words), which can be used to learn about the language and even generate sentences.
Another approach is the Bag-of-Words model.
The Bag-of-Words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.
https://en.wikipedia.org/wiki/Bag-of-words_model
What does that all mean?
- The structure is not important
- Works well to classify
- Example could be
- I love this product.
- This product feels cheap.
- This is the best product ever.
Step 3: What is Naive Bayes’ Classifier?
Naive Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong (naΓ―ve) independence assumptions between the features (wiki).
Bayes’ Rule Theorem
Describes the probability of an event, based on prior knowledge of conditions that might be related to the event (wiki).
π(π|π) = π(π|π)π(π) / π(π)
Explained with Example
What is the probability that the sentiment is positive giving the sentence “I love this product”. This can be expressed as follows.
π(positive|”I love this product”)=π(positive|”I”, “love”, “this”, “product”)
Bayes’s Rule implies it is equal to
π(“I”, “love”, “this”, “product”|positive)π(positive) / π(“I”, “love”, “this”, “product”)
Or proportional to
π(“I”, “love”, “this”, “product”|positive)π(positive)
The ‘Naive‘ part we use this to simplify
π(positive)π(“I”|positive)π(“love”|positive)π(“this”|positive)π(“product”|positive)
Ant then we have that.
π(positive) = number of positive samples number of samples.
π(“love”|positive) = number of positive samples with “love”number of positive samples.
Let’s try a more concrete example.

π(positive)π(“I”|positive)π(“love”|positive)π(“this”|positive)π(“product”|positive) = 0.47β0.30β0.40β0.28β0.25=0.003948
π(negative)π(“I”|negative)π(“love”|negative)π(“this”|negative)π(“product”|negative)=0.53β0.20β0.05β0.42β0.28 = 0.00062328
Calculate the likelyhood
“I love this product” is positive: 0.00394 / (0.00394 + 0.00062328) = 86.3%
“I love this product” is negative: 0.00062328 / (0.00394 + 0.00062328) = 13.7%
Step 4: The Problem with Naive Bayes’ Classifier?
Problem
If a word never showed up in a sentence, then this will result in a probability of zero. Say, in the above example that the word “product” was not represented in a positive sentence. This would imply that the probability P(“product” | positive) = 0, which would imply that the calculations for “I love this product” is positive would be 0.
There are different approaches to deal with this problem.
Additive Smoothing
Adding a value to each value in the distribution to smooth the data. This is straight forward, this ensures that even if the word “product” never showed up, then it will not create a 0 value.
Laplace smoothing
Adding 1 to each value in the distribution. This is just a concrete example of adding 1 to it.
Step 5: Use NLTK to classify sentiment
We already introduced the NLTK, which we will use here.
import nltk
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/sentiment.csv')
def extract_words(document):
return set(
word.lower() for word in nltk.word_tokenize(document)
if any(c.isalpha() for c in word)
)
words = set()
for line in data['Text'].to_list():
words.update(extract_words(line))
features = []
for _, row in data.iterrows():
features.append(({word: (word in row['Text']) for word in words}, row['Label']))
classifier = nltk.NaiveBayesClassifier.train(features)
This creates a classifier (based on a small dataset, don’t expect magic).
To use it, try the following code.
s = input()
feature = {word: (word in extract_words(s)) for word in words}
result = classifier.prob_classify(feature)
for key in result.samples():
print(key, result.prob(key))
Example could be if you input “this was great”.
this was great
Negative 0.10747100603951745
Positive 0.8925289939604821
Want to learn more?
If you followed the video you would also be introduced to a project where we create a sentiment classifier on a big twitter corpus.
In the next lesson you will learn how to Implement a Term Frequency by Inverse Document Frequency (TF-IDF) with NLTK.
This is part of a FREE 10h Machine Learning course with Python.
- 15 video lessons β which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
- 30 JuPyter Notebooks β with the full code and explanation from the lectures and projects (GitHub).
- 15 projects β with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).
Python for Finance: Unlock Financial Freedom and Build Your Dream Life
Discover the key to financial freedom and secure your dream life with Python for Finance!
Say goodbye to financial anxiety and embrace a future filled with confidence and success. If you’re tired of struggling to pay bills and longing for a life of leisure, it’s time to take action.
Imagine breaking free from that dead-end job and opening doors to endless opportunities. With Python for Finance, you can acquire the invaluable skill of financial analysis that will revolutionize your life.
Make informed investment decisions, unlock the secrets of business financial performance, and maximize your money like never before. Gain the knowledge sought after by companies worldwide and become an indispensable asset in today’s competitive market.
Don’t let your dreams slip away. Master Python for Finance and pave your way to a profitable and fulfilling career. Start building the future you deserve today!
Python for Finance a 21 hours course that teaches investing with Python.
Learn pandas, NumPy, Matplotlib for Financial Analysis & learn how to Automate Value Investing.
“Excellent course for anyone trying to learn coding and investing.” – Lorenzo B.
