Learn how you can become a Python programmer in just 12 weeks.

    We respect your privacy. Unsubscribe at anytime.

    Implement a Term Frequency by Inverse Document Frequency (TF-IDF) with NLTK

    Why it’s great to master Information Retrieval?

    Mastering Information Retrieval offers several advantages in the field of text analysis and information management:

    1. Efficient information retrieval: Information Retrieval techniques enable efficient and accurate retrieval of relevant information from large collections of documents, saving time and effort in searching for specific data.
    2. Enhanced search capabilities: Understanding Information Retrieval allows you to develop advanced search systems with features like relevance ranking, query expansion, and personalized recommendations, improving the overall search experience.
    3. Organizing and structuring data: Information Retrieval techniques help in organizing and structuring unstructured text data, enabling better management, categorization, and clustering of documents.
    4. Domain-specific applications: Information Retrieval has diverse applications in various domains, including search engines, recommender systems, digital libraries, e-commerce, legal research, and more.

    What will be covered in this tutorial?

    In this tutorial on Information Retrieval, we will cover the following topics:

    • Understanding Information Retrieval: Exploring the concept and significance of Information Retrieval in efficiently retrieving relevant information from large document collections.
    • Topic modeling documents: Learning techniques for identifying and extracting topics within a collection of documents, enabling effective organization and understanding of the underlying themes and concepts.
    • Term Frequency and its limitations: Understanding the concept of Term Frequency, its role in measuring the importance of terms within a document, and recognizing its limitations in capturing document relevance accurately.
    • Implementing Term Frequency-Inverse Document Frequency (TF-IDF): Exploring the TF-IDF technique, which combines Term Frequency with Inverse Document Frequency to better assess the importance of terms in documents and improve retrieval accuracy.
    • Practical applications: Applying the learned techniques to real-world scenarios, such as building a search engine, developing document clustering systems, or enhancing information retrieval capabilities in specific domains.

    By mastering these concepts and techniques, you will gain valuable skills to efficiently retrieve, organize, and extract relevant information from large document collections, contributing to effective data management and knowledge discovery.

    Watch tutorial

    Step 1: What is Information Retrieval (IR)?

    The task of finding relevant documents in response to a user query. Web search engines are the most visible IR applications (wiki).

    Topic modeling is a model for discovering the topics for a set of documents, e.g., it can provide us with methods to organize, understand and summarize large collections of textual information.

    Topic modeling can be described as a method for finding a group of words that best represent the information.

    Step 2: Approach 1: Term Frequency

    Term Frequency is the number of times a term occurs in a document is called its term frequency (wiki).

    tf(๐‘ก,๐‘‘)=๐‘“๐‘ก,๐‘‘: The number of time term ๐‘ก occurs in document ๐‘‘.

    There are other ways to define term frequency (see wiki).

    Let’s try to write some code to explore this concept.

    To follow this code you need to download the files here from here: GitHub link. You can download them as a zip file from here: Zip-download.

    import os
    import nltk
    import math
    corpus = {}
    # Count the term frequencies
    for filename in os.listdir('files/holmes/'):
        with open(f'files/holmes/{filename}') as f:
            content = [word.lower() for word in nltk.word_tokenize(f.read()) if word.isalpha()]
            
            freq = {word: content.count(word) for word in set(content)}
            
            corpus[filename] = freq
    for filename in corpus:
        corpus[filename] = sorted(corpus[filename].items(), key=lambda x: x[1], reverse=True)
    for filename in corpus:
        print(filename)
        for word, score in corpus[filename][:5]:
            print(f'  {word}: {score}')
    

    This will output (only sample output).

    speckled.txt
      the: 600
      and: 281
      of: 276
      a: 252
      i: 233
    face.txt
      the: 326
      i: 298
      and: 226
      to: 185
      a: 173
    

    We see that the words most used in each documents are so called stop-word.

    • words that have little meaning on their own (wiki)
    • Examples: am, by, do, is, which, ….
    • Student exercise: Remove function words and see result (HINT: nltk has a list of stopwords)

    What you will discover if you remove all stop-words, then you will still not get anything very useful. There are some words that are just more common.

    Step 3: Approach 2: TF-IDF

    TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. (wiki)

    Inverse Document Frequency

    • Measure of how common or rare a word is across documents

    idf(๐‘ก,๐ท)=log๐‘|๐‘‘โˆˆ๐ท:๐‘กโˆˆ๐‘‘|=log(Total Documents / Number of Documents Containing “term”)

    • ๐ท: All documents in the corpus
    • ๐‘: total number of documents in the corpus ๐‘=|๐ท|

    TF-IDF

    Ranking of what words are important in a document by multiplying Term Frequencey (TF) by Inverse Document Frequency (IDF)

    tf-idf(๐‘ก,๐‘‘)=tf(๐‘ก,๐‘‘)โ‹…idf(๐‘ก,๐ท)

    Let’s make a small example.

    doc1 = "This is the sample of the day".split()
    doc2 = "This is another sample of the day".split()
    corpus = [doc1, doc2]
    tf1 = {word: doc1.count(word) for word in set(doc1)}
    tf2 = {word: doc2.count(word) for word in set(doc2)}
    term = 'another'
    ids = 2/sum(term in doc for doc in corpus)
    tf1.get(term, 0)*ids, tf2.get(term, 0)*ids
    

    Want to learn more?

    If you watch the YouTube video you will see how to do it for a bigger corpus of files.

    In the next lesson you will learn Information Extraction with Skip-Gram Architecture.

    This is part of a FREE 10h Machine Learning course with Python.

    • 15 video lessons โ€“ which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
    • 30 JuPyter Notebooks โ€“ with the full code and explanation from the lectures and projects (GitHub).
    • 15 projects โ€“ with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).

    Python for Finance: Unlock Financial Freedom and Build Your Dream Life

    Discover the key to financial freedom and secure your dream life with Python for Finance!

    Say goodbye to financial anxiety and embrace a future filled with confidence and success. If you’re tired of struggling to pay bills and longing for a life of leisure, it’s time to take action.

    Imagine breaking free from that dead-end job and opening doors to endless opportunities. With Python for Finance, you can acquire the invaluable skill of financial analysis that will revolutionize your life.

    Make informed investment decisions, unlock the secrets of business financial performance, and maximize your money like never before. Gain the knowledge sought after by companies worldwide and become an indispensable asset in today’s competitive market.

    Don’t let your dreams slip away. Master Python for Finance and pave your way to a profitable and fulfilling career. Start building the future you deserve today!

    Python for Finance a 21 hours course that teaches investing with Python.

    Learn pandas, NumPy, Matplotlib for Financial Analysis & learn how to Automate Value Investing.

    “Excellent course for anyone trying to learn coding and investing.” – Lorenzo B.

    Leave a Comment