# Implement a Term Frequency by Inverse Document Frequency (TF-IDF) with NLTK

## What will we cover?

• Learn what Information Retrieval is
• Topic modeling documents
• How to use Term Frequency and understand the limitations
• Implement Term Frequency by Inverse Document Frequency (TF-IDF)

## Step 1: What is Information Retrieval (IR)?

The task of finding relevant documents in response to a user query. Web search engines are the most visible IR applications (wiki).

Topic modeling is a model for discovering the topics for a set of documents, e.g., it can provide us with methods to organize, understand and summarize large collections of textual information.

Topic modeling can be described as a method for finding a group of words that best represent the information.

## Step 2: Approach 1: Term Frequency

Term Frequency is the number of times a term occurs in a document is called its term frequency (wiki).

tf(𝑡,𝑑)=𝑓𝑡,𝑑: The number of time term 𝑡 occurs in document 𝑑.

There are other ways to define term frequency (see wiki).

Let’s try to write some code to explore this concept.

```import os
import nltk
import math
corpus = {}
# Count the term frequencies
for filename in os.listdir('files/holmes/'):
with open(f'files/holmes/{filename}') as f:
content = [word.lower() for word in nltk.word_tokenize(f.read()) if word.isalpha()]

freq = {word: content.count(word) for word in set(content)}

corpus[filename] = freq
for filename in corpus:
corpus[filename] = sorted(corpus[filename].items(), key=lambda x: x, reverse=True)
for filename in corpus:
print(filename)
for word, score in corpus[filename][:5]:
print(f'  {word}: {score}')
```

This will output (only sample output).

```speckled.txt
the: 600
and: 281
of: 276
a: 252
i: 233
face.txt
the: 326
i: 298
and: 226
to: 185
a: 173
```

We see that the words most used in each documents are so called stop-word.

• words that have little meaning on their own (wiki)
• Examples: am, by, do, is, which, ….
• Student exercise: Remove function words and see result (HINT: nltk has a list of stopwords)

What you will discover if you remove all stop-words, then you will still not get anything very useful. There are some words that are just more common.

## Step 3: Approach 2: TF-IDF

TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. (wiki)

### Inverse Document Frequency

• Measure of how common or rare a word is across documents

idf(𝑡,𝐷)=log𝑁|𝑑∈𝐷:𝑡∈𝑑|=log(Total Documents / Number of Documents Containing “term”)

• 𝐷: All documents in the corpus
• 𝑁: total number of documents in the corpus 𝑁=|𝐷|

### TF-IDF

Ranking of what words are important in a document by multiplying Term Frequencey (TF) by Inverse Document Frequency (IDF)

tf-idf(𝑡,𝑑)=tf(𝑡,𝑑)⋅idf(𝑡,𝐷)

Let’s make a small example.

```doc1 = "This is the sample of the day".split()
doc2 = "This is another sample of the day".split()
corpus = [doc1, doc2]
tf1 = {word: doc1.count(word) for word in set(doc1)}
tf2 = {word: doc2.count(word) for word in set(doc2)}
term = 'another'
ids = 2/sum(term in doc for doc in corpus)
tf1.get(term, 0)*ids, tf2.get(term, 0)*ids
```

If you watch the YouTube video you will see how to do it for a bigger corpus of files.

This is part of a FREE 10h Machine Learning course with Python.

• 15 video lessons – which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
• 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
• 15 projects – with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).

## Learn Python

Learn Python A BEGINNERS GUIDE TO PYTHON

• 70 pages to get you started on your journey to master Python.
• How to install your setup with Anaconda.
• Written description and introduction to all concepts.
• Jupyter Notebooks prepared for 17 projects.

Python 101: A CRASH COURSE

1. How to get started with this 8 hours Python 101: A CRASH COURSE.
2. Best practices for learning Python.
4. A chapter for each lesson with a descriptioncode snippets for easy reference, and links to a lesson video.

## Expert Data Science Blueprint

Expert Data Science Blueprint

• Master the Data Science Workflow for actionable data insights.