Learn how you can become a Python programmer in just 12 weeks.

    We respect your privacy. Unsubscribe at anytime.

    How to use Natural Language Processing for Trigrams

    Why it’s great to master parsing language syntax?

    Mastering the parsing of language syntax offers several advantages in the field of natural language processing and computational linguistics:

    1. Enhanced language comprehension: Understanding how the simple syntax of language can be parsed enables you to grasp the structure and meaning of sentences, facilitating more accurate language comprehension and interpretation.
    2. Advanced language processing systems: By mastering parsing techniques, you can develop sophisticated language processing systems such as machine translation, text generation, sentiment analysis, and information extraction, improving the overall quality and accuracy of language-based applications.
    3. Efficient information retrieval: Parsing language syntax allows for effective information retrieval by enabling the extraction of relevant linguistic patterns, entities, and relationships from text, leading to improved search and retrieval systems.

    What will be covered in this tutorial?

    In this tutorial on parsing language syntax, we will cover the following topics:

    • Understanding Context-Free Grammar (CFG): Exploring the concept of CFG, a formal language model used to describe the syntax of a language, and its role in parsing.
    • Using CFG to parse text: Applying CFG rules and algorithms to parse sentences and determine their syntactic structure, allowing for the identification of phrases, constituents, and grammatical relationships.
    • Analyzing text in trigrams: Investigating the use of trigrams, which are sequences of three words, to analyze and understand the statistical patterns and co-occurrences in text data, aiding in tasks such as language modeling and text prediction.
    • Brief introduction to Markov Chains: Introducing Markov Chains, a mathematical model that describes a sequence of events where the probability of each event depends only on the previous event, and its relevance in language processing and text generation.
    • Generating predictions with Markov Chains: Demonstrating how Markov Chains can be used to generate text by probabilistically predicting the next word or sequence of words based on previous observations, enabling applications such as text generation and completion.

    By mastering these concepts and techniques, you will gain valuable skills in parsing language syntax, enabling enhanced language comprehension, advanced language processing, and efficient information retrieval in various natural language processing applications.

    Watch tutorial

    Step 1: What is Natural Language Processing?

    Natural language processing (NLP) is a subfield of linguisticscomputer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of “understanding” the contents of documents, including the contextual nuances of the language within them.

    https://en.wikipedia.org/wiki/Natural_language_processing

    Simply said, NLP is automatic computational processing of human language.

    This includes.

    • Algorithms that take human written language as input
    • Algorithms that produce natural text

    And some examples include.

    • Automatic summarization
    • Language identification
    • Translation

    Step 2: What is Context-Free Grammar (CFG)?

    What is a Syntax?

    One basic description of a language’s syntax is the sequence in which the subject, verb, and object usually appear in sentences.

    What is a Formal Grammar?

    A system of rules for generating sentences in a language and a grammar is usually thought of as a language generator (wiki).

    What is a Context-Free Grammar (CFG)?

    A formal grammar is “context free” if its production rules can be applied regardless of the context of a nonterminal (wiki).

    Step 3: How to use NLTK and see the Challenge with CFG

    What is NLTK?

    NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources.

    You can install it by the following command.

    pip install nltk
    

    Notice, that you can do that inside from you JuPyter Notebook with this command.

    !pip install nltk
    

    Let’s write a CFG and understand the challenge working with language like that.

    import nltk
    grammar = nltk.CFG.fromstring("""
        S -> NP VP
        NP -> D N | N
        VP -> V | V NP
        D -> "the" | "a"
        N -> "she" | "city" | "car"
        V -> "saw" | "walked"    
    """)
    parser = nltk.ChartParser(grammar)
    sentence = input().split()
    for tree in parser.parse(sentence):
        tree.pretty_print()
    

    If you run that code and type: she saw a car then you will get the following.

    Think about CFG’s this way – if you are a computer, yes, you can generate all these trees representing the CFG – but there is a challenge.

    You need to encode all possibilities. That is, the above grammar only understand the encoded words.

    To have a full language grammar, it becomes very complex – or should we say – impossible.

    What to do then?

    Step 4: Use N-grams to understand language

    The idea behind n-grams is to understand a small subset of the language. Not to focus on the bigger picture, but just a small subset of it.

    You could set up as follows.

    • 𝑛-gram
      • a contiguous sequence of π‘›n items from a sample text
    • Word π‘›-gram
      • a contiguous sequence of π‘›n words from a sample text
    • unigram
      • 1 items in sequence
    • bigram
      • 2 items in sequence
    • trigram
      • 3 items in sequence

    We will focus on 3-grams – and the reason for that is if you need 4-grams or above, then you need a lot of text to make it useful.

    Again, a trigram is taking 3-word contexts and looking at that isolated.

    Let’s try to work with that.

    Step 5: Word Tokenization

    Word Tokenization is the task of splitting a sequence of words into tokens. This makes further processing easier.

    Notice, we need to consider commas, punctuations etc.

    To follow this code you need to download the files here from here: GitHub link. You can download them as a zip file from here: Zip-download.

    Here we read all the content and tokenize it.

    import os
    from collections import Counter
    # You need to download this
    nltk.download('punkt')
    content = []
    for filename in os.listdir('files/holmes/'):
        with open(f'files/holmes/{filename}') as f:
            content.append(f.read())
    corpus = []
    for item in content:
        corpus.extend([word.lower() for word in nltk.word_tokenize(item) if any(c.isalpha() for c in word)])
    

    Now we have all the tokens in the corpus.

    Step 6: Generating trigrams from the corpus

    Now it is straight forward to generate trigrams from the corpus.

    ngrams = Counter(nltk.ngrams(corpus, 3))
    

    What to use it for?

    Well, you can look for which 3 words are most likely to in a sequence.

    for ngram, freq in ngrams.most_common(10):
        print(f'{freq}: {ngram}')
    

    Giving the following output.

    80: ('it', 'was', 'a')
    71: ('one', 'of', 'the')
    65: ('i', 'think', 'that')
    59: ('out', 'of', 'the')
    55: ('that', 'it', 'was')
    55: ('that', 'he', 'had')
    55: ('there', 'was', 'a')
    55: ('that', 'he', 'was')
    52: ('it', 'is', 'a')
    49: ('i', 'can', 'not')
    

    First time I saw that, I don’t think I really appreciated the full aspect of that. But actually, you can learn a lot from that. If you look into the project (see YouTube video), then you are will see you can predict who is the person behind a Twitter account.

    Yes, that is right. You will be surprised.

    Step 7: What is Markov Models

    What is the next step?

    Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous even (wiki)

    That is exactly the next step of what we did before.

    Given any two words, then we have created probabilities of the next word.

    This can be done by using the markovify library. Install it as follows.

    pip install markovify
    

    Then you can create an example like this.

    import markovify
    with open('https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/shakespeare.txt') as f:
        text = f.read()
    model = markovify.Text(text)
    model.make_sentence()
    

    This will generate a random sentence from that idea.

    'In the wars; defeat thy favor with an ordinary pitch, Who else but I, his forlorn duchess, Was made much poorer by it; but first, how get hence.'

    Maybe not that good.

    Want to learn more?

    In the next lesson you will learn the Naive Bayes’ Rule for Sentiment Classification.

    This is part of a FREE 10h Machine Learning course with Python.

    • 15 video lessons β€“ which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
    • 30 JuPyter Notebooks β€“ with the full code and explanation from the lectures and projects (GitHub).
    • 15 projects β€“ with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).

    Python for Finance: Unlock Financial Freedom and Build Your Dream Life

    Discover the key to financial freedom and secure your dream life with Python for Finance!

    Say goodbye to financial anxiety and embrace a future filled with confidence and success. If you’re tired of struggling to pay bills and longing for a life of leisure, it’s time to take action.

    Imagine breaking free from that dead-end job and opening doors to endless opportunities. With Python for Finance, you can acquire the invaluable skill of financial analysis that will revolutionize your life.

    Make informed investment decisions, unlock the secrets of business financial performance, and maximize your money like never before. Gain the knowledge sought after by companies worldwide and become an indispensable asset in today’s competitive market.

    Don’t let your dreams slip away. Master Python for Finance and pave your way to a profitable and fulfilling career. Start building the future you deserve today!

    Python for Finance a 21 hours course that teaches investing with Python.

    Learn pandas, NumPy, Matplotlib for Financial Analysis & learn how to Automate Value Investing.

    “Excellent course for anyone trying to learn coding and investing.” – Lorenzo B.

    Leave a Comment