# Implement a Term Frequency by Inverse Document Frequency (TF-IDF) with NLTK

## What will we cover?

• Learn what Information Retrieval is
• Topic modeling documents
• How to use Term Frequency and understand the limitations
• Implement Term Frequency by Inverse Document Frequency (TF-IDF)

## Step 1: What is Information Retrieval (IR)?

The task of finding relevant documents in response to a user query. Web search engines are the most visible IR applications (wiki).

Topic modeling is a model for discovering the topics for a set of documents, e.g., it can provide us with methods to organize, understand and summarize large collections of textual information.

Topic modeling can be described as a method for finding a group of words that best represent the information.

## Step 2: Approach 1: Term Frequency

Term Frequency is the number of times a term occurs in a document is called its term frequency (wiki).

tf(𝑡,𝑑)=𝑓𝑡,𝑑: The number of time term 𝑡 occurs in document 𝑑.

There are other ways to define term frequency (see wiki).

Let’s try to write some code to explore this concept.

```import os
import nltk
import math

corpus = {}

# Count the term frequencies
for filename in os.listdir('files/holmes/'):
with open(f'files/holmes/{filename}') as f:
content = [word.lower() for word in nltk.word_tokenize(f.read()) if word.isalpha()]

freq = {word: content.count(word) for word in set(content)}

corpus[filename] = freq

for filename in corpus:
corpus[filename] = sorted(corpus[filename].items(), key=lambda x: x[1], reverse=True)

for filename in corpus:
print(filename)
for word, score in corpus[filename][:5]:
print(f'  {word}: {score}')
```

This will output (only sample output).

```speckled.txt
the: 600
and: 281
of: 276
a: 252
i: 233
face.txt
the: 326
i: 298
and: 226
to: 185
a: 173
```

We see that the words most used in each documents are so called stop-word.

• words that have little meaning on their own (wiki)
• Examples: am, by, do, is, which, ….
• Student exercise: Remove function words and see result (HINT: nltk has a list of stopwords)

What you will discover if you remove all stop-words, then you will still not get anything very useful. There are some words that are just more common.

## Step 3: Approach 2: TF-IDF

TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. (wiki)

### Inverse Document Frequency

• Measure of how common or rare a word is across documents

idf(𝑡,𝐷)=log𝑁|𝑑∈𝐷:𝑡∈𝑑|=log(Total Documents / Number of Documents Containing “term”)

• 𝐷: All documents in the corpus
• 𝑁: total number of documents in the corpus 𝑁=|𝐷|

### TF-IDF

Ranking of what words are important in a document by multiplying Term Frequencey (TF) by Inverse Document Frequency (IDF)

tf-idf(𝑡,𝑑)=tf(𝑡,𝑑)⋅idf(𝑡,𝐷)

Let’s make a small example.

```doc1 = "This is the sample of the day".split()
doc2 = "This is another sample of the day".split()

corpus = [doc1, doc2]

tf1 = {word: doc1.count(word) for word in set(doc1)}
tf2 = {word: doc2.count(word) for word in set(doc2)}

term = 'another'
ids = 2/sum(term in doc for doc in corpus)

tf1.get(term, 0)*ids, tf2.get(term, 0)*ids
```