How to use Natural Language Processing for Trigrams

What will we cover?

  • How the simple syntax of language can be parsed
  • What Context-Free Grammar (CFG) is
  • Use it to parse text
  • Understand text in trigrams
  • A brief look at Markov Chains
  • See how it can be used to generate predictions

Step 1: What is Natural Language Processing?

Natural language processing (NLP) is a subfield of linguisticscomputer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of “understanding” the contents of documents, including the contextual nuances of the language within them.

https://en.wikipedia.org/wiki/Natural_language_processing

Simply said, NLP is automatic computational processing of human language.

This includes.

  • Algorithms that take human written language as input
  • Algorithms that produce natural text

And some examples include.

  • Automatic summarization
  • Language identification
  • Translation

Step 2: What is Context-Free Grammar (CFG)?

What is a Syntax?

One basic description of a language’s syntax is the sequence in which the subject, verb, and object usually appear in sentences.

What is a Formal Grammar?

A system of rules for generating sentences in a language and a grammar is usually thought of as a language generator (wiki).

What is a Context-Free Grammar (CFG)?

A formal grammar is “context free” if its production rules can be applied regardless of the context of a nonterminal (wiki).

Step 3: How to use NLTK and see the Challenge with CFG

What is NLTK?

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources.

You can install it by the following command.

pip install nltk

Notice, that you can do that inside from you JuPyter Notebook with this command.

!pip install nltk

Let’s write a CFG and understand the challenge working with language like that.

import nltk
grammar = nltk.CFG.fromstring("""
    S -> NP VP
    NP -> D N | N
    VP -> V | V NP
    D -> "the" | "a"
    N -> "she" | "city" | "car"
    V -> "saw" | "walked"    
""")
parser = nltk.ChartParser(grammar)
sentence = input().split()
for tree in parser.parse(sentence):
    tree.pretty_print()

If you run that code and type: she saw a car then you will get the following.

Think about CFG’s this way – if you are a computer, yes, you can generate all these trees representing the CFG – but there is a challenge.

You need to encode all possibilities. That is, the above grammar only understand the encoded words.

To have a full language grammar, it becomes very complex – or should we say – impossible.

What to do then?

Step 4: Use N-grams to understand language

The idea behind n-grams is to understand a small subset of the language. Not to focus on the bigger picture, but just a small subset of it.

You could set up as follows.

  • 𝑛-gram
    • a contiguous sequence of 𝑛n items from a sample text
  • Word 𝑛-gram
    • a contiguous sequence of 𝑛n words from a sample text
  • unigram
    • 1 items in sequence
  • bigram
    • 2 items in sequence
  • trigram
    • 3 items in sequence

We will focus on 3-grams – and the reason for that is if you need 4-grams or above, then you need a lot of text to make it useful.

Again, a trigram is taking 3-word contexts and looking at that isolated.

Let’s try to work with that.

Step 5: Word Tokenization

Word Tokenization is the task of splitting a sequence of words into tokens. This makes further processing easier.

Notice, we need to consider commas, punctuations etc.

To follow this code you need to download the files here from here: GitHub link. You can download them as a zip file from here: Zip-download.

Here we read all the content and tokenize it.

import os
from collections import Counter
# You need to download this
nltk.download('punkt')
content = []
for filename in os.listdir('files/holmes/'):
    with open(f'files/holmes/{filename}') as f:
        content.append(f.read())
corpus = []
for item in content:
    corpus.extend([word.lower() for word in nltk.word_tokenize(item) if any(c.isalpha() for c in word)])

Now we have all the tokens in the corpus.

Step 6: Generating trigrams from the corpus

Now it is straight forward to generate trigrams from the corpus.

ngrams = Counter(nltk.ngrams(corpus, 3))

What to use it for?

Well, you can look for which 3 words are most likely to in a sequence.

for ngram, freq in ngrams.most_common(10):
    print(f'{freq}: {ngram}')

Giving the following output.

80: ('it', 'was', 'a')
71: ('one', 'of', 'the')
65: ('i', 'think', 'that')
59: ('out', 'of', 'the')
55: ('that', 'it', 'was')
55: ('that', 'he', 'had')
55: ('there', 'was', 'a')
55: ('that', 'he', 'was')
52: ('it', 'is', 'a')
49: ('i', 'can', 'not')

First time I saw that, I don’t think I really appreciated the full aspect of that. But actually, you can learn a lot from that. If you look into the project (see YouTube video), then you are will see you can predict who is the person behind a Twitter account.

Yes, that is right. You will be surprised.

Step 7: What is Markov Models

What is the next step?

Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous even (wiki)

That is exactly the next step of what we did before.

Given any two words, then we have created probabilities of the next word.

This can be done by using the markovify library. Install it as follows.

pip install markovify

Then you can create an example like this.

import markovify
with open('https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/shakespeare.txt') as f:
    text = f.read()
model = markovify.Text(text)
model.make_sentence()

This will generate a random sentence from that idea.

'In the wars; defeat thy favor with an ordinary pitch, Who else but I, his forlorn duchess, Was made much poorer by it; but first, how get hence.'

Maybe not that good.

Want to learn more?

This is part of a FREE 10h Machine Learning course with Python.

  • 15 video lessons – which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).

Leave a Reply

%d bloggers like this: