Implement a Term Frequency by Inverse Document Frequency (TF-IDF) with NLTK

What will we cover?

  • Learn what Information Retrieval is
  • Topic modeling documents
  • How to use Term Frequency and understand the limitations
  • Implement Term Frequency by Inverse Document Frequency (TF-IDF)

Step 1: What is Information Retrieval (IR)?

The task of finding relevant documents in response to a user query. Web search engines are the most visible IR applications (wiki).

Topic modeling is a model for discovering the topics for a set of documents, e.g., it can provide us with methods to organize, understand and summarize large collections of textual information.

Topic modeling can be described as a method for finding a group of words that best represent the information.

Step 2: Approach 1: Term Frequency

Term Frequency is the number of times a term occurs in a document is called its term frequency (wiki).

tf(๐‘ก,๐‘‘)=๐‘“๐‘ก,๐‘‘: The number of time termย ๐‘กย occurs in documentย ๐‘‘.

There are other ways to define term frequency (seeย wiki).

Let’s try to write some code to explore this concept.

To follow this code you need to download the files here from here:ย GitHub link.ย You can download them as a zip file from here:ย Zip-download.

import os
import nltk
import math
corpus = {}
# Count the term frequencies
for filename in os.listdir('files/holmes/'):
    with open(f'files/holmes/{filename}') as f:
        content = [word.lower() for word in nltk.word_tokenize(f.read()) if word.isalpha()]
        
        freq = {word: content.count(word) for word in set(content)}
        
        corpus[filename] = freq
for filename in corpus:
    corpus[filename] = sorted(corpus[filename].items(), key=lambda x: x[1], reverse=True)
for filename in corpus:
    print(filename)
    for word, score in corpus[filename][:5]:
        print(f'  {word}: {score}')

This will output (only sample output).

speckled.txt
  the: 600
  and: 281
  of: 276
  a: 252
  i: 233
face.txt
  the: 326
  i: 298
  and: 226
  to: 185
  a: 173

We see that the words most used in each documents are so called stop-word.

  • words that have little meaning on their own (wiki)
  • Examples: am, by, do, is, which, ….
  • Student exercise: Remove function words and see result (HINT: nltk has a list of stopwords)

What you will discover if you remove all stop-words, then you will still not get anything very useful. There are some words that are just more common.

Step 3: Approach 2: TF-IDF

TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. (wiki)

Inverse Document Frequency

  • Measure of how common or rare a word is across documents

idf(๐‘ก,๐ท)=log๐‘|๐‘‘โˆˆ๐ท:๐‘กโˆˆ๐‘‘|=log(Total Documents / Number of Documents Containing “term”)

  • ๐ท: All documents in the corpus
  • ๐‘: total number of documents in the corpusย ๐‘=|๐ท|

TF-IDF

Ranking of what words are important in a document by multiplying Term Frequencey (TF) by Inverse Document Frequency (IDF)

tf-idf(๐‘ก,๐‘‘)=tf(๐‘ก,๐‘‘)โ‹…idf(๐‘ก,๐ท)

Let’s make a small example.

doc1 = "This is the sample of the day".split()
doc2 = "This is another sample of the day".split()
corpus = [doc1, doc2]
tf1 = {word: doc1.count(word) for word in set(doc1)}
tf2 = {word: doc2.count(word) for word in set(doc2)}
term = 'another'
ids = 2/sum(term in doc for doc in corpus)
tf1.get(term, 0)*ids, tf2.get(term, 0)*ids

Want to learn more?

If you watch the YouTube video you will see how to do it for a bigger corpus of files.

This is part of a FREE 10h Machine Learning course with Python.

  • 15 video lessons โ€“ which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
  • 30 JuPyter Notebooks โ€“ with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects โ€“ with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).

Leave a Reply

%d bloggers like this: