Learn Information Extraction with Skip-Gram Architecture

What will we cover?

  • What is Information Extraction
  • Extract knowledge from patterns
  • Word representation
  • Skip-Gram architecture
  • To see how words relate to each other (this is surprising)

What is Information Extraction?

Information Extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents (wiki).

Let’s try some different approaches.

Approach 1: Extract Knowledge from Patters

Given data knowledge that is fit together – then try to find patterns.

This is actually a powerful approach. Assume you know that Amazon was founded in 1992 and Facebook was founded in 2004.

A pattern could be be “When {company} was founded in {year},”

Let’s try this in real life.

import pandas as pd
import re

# Reading a knowledge base (here only one entry in the csv file)
books = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/books.csv', header=None)

# Convert to t a list
book_list = books.values.tolist()

# Read some content (here a web-page)
with open('https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/penguin.html') as f:
    corpus = f.read()

corpus = corpus.replace('\n', ' ').replace('\t', ' ')

# Try to look where we find our knowledge to find patters
for val1, val2 in book_list:
    print(val1, '-', val2)
    for i in range(0, len(corpus) - 100, 20):
        pattern = corpus[i:i + 100]
        if val1 in pattern and val2 in pattern:
            print('-:', pattern)

This gives the following.

1984 - George Orwell
-: ge-orwell-with-a-foreword-by-thomas-pynchon/">1984</a></h2>   <h2 class="author">by George Orwell</h
-: eword-by-thomas-pynchon/">1984</a></h2>   <h2 class="author">by George Orwell</h2>    <div class="de
-: hon/">1984</a></h2>   <h2 class="author">by George Orwell</h2>    <div class="desc">We were pretty c
The Help - Kathryn Stockett
-: /the-help-by-kathryn-stockett/">The Help</a></h2>   <h2 class="author">by Kathryn Stockett</h2>    <
-: -stockett/">The Help</a></h2>   <h2 class="author">by Kathryn Stockett</h2>    <div class="desc">Thi

This gives you an idea of some patterns.

prefix = re.escape('/">')
middle = re.escape('</a></h2>   <h2 class="author">by ')
suffix = re.escape('</h2>    <div class="desc">')

regex = f"{prefix}(.{{0,50}}?){middle}(.{{0,50}}?){suffix}"
results = re.findall(regex, corpus)

for result in results:

Giving the following pattern matches with new knowledge.

[('War and Peace', 'Leo Tolstoy'),
 ('Song of Solomon', 'Toni Morrison'),
 ('Ulysses', 'James Joyce'),
 ('The Shadow of the Wind', 'Carlos Ruiz Zafon'),
 ('The Lord of the Rings', 'J.R.R. Tolkien'),
 ('The Satanic Verses', 'Salman Rushdie'),
 ('Don Quixote', 'Miguel de Cervantes'),
 ('The Golden Compass', 'Philip Pullman'),
 ('Catch-22', 'Joseph Heller'),
 ('1984', 'George Orwell'),
 ('The Kite Runner', 'Khaled Hosseini'),
 ('Little Women', 'Louisa May Alcott'),
 ('The Cloud Atlas', 'David Mitchell'),
 ('The Fountainhead', 'Ayn Rand'),
 ('The Picture of Dorian Gray', 'Oscar Wilde'),
 ('Lolita', 'Vladimir Nabokov'),
 ('The Help', 'Kathryn Stockett'),
 ("The Liar's Club", 'Mary Karr'),
 ('Moby-Dick', 'Herman Melville'),
 ("Gravity's Rainbow", 'Thomas Pynchon'),
 ("The Handmaid's Tale", 'Margaret Atwood')]

Approach 2: Skip-Gram Architecture

One-Hot Representation

  • Representation word as a vector with a single 1, and with other values as 0
  • Maybe not useful to have with

Distributed Representation

  • representation of meaning distributed across multiple values

How to define words as vectors

  • Word is defined by what words suround it
  • Based on the context
  • What words happen to show up around it


  • model for generating word vectors

Skip-Gram Architecture

  • Neural network architecture for predicting context words given a target word
    • Given a word – what words show up around it in a context
  • Example
    • Given target word (input word) – train the network of which context words (right side)
    • Then the weights from input node (target word) to hidden layer (5 weights) give a representation
    • Hence – the word will be represented by a vector
    • The number of hidden nodes represent how big the vector should be (here 5)
  • Idea is as follows
    • Each input word will get weights to the hidden layers
    • The hidden layers will be trained
    • Then each word will be represented as the weights of hidden layers
  • Intuition
    • If two words have similar context (they show up the same places) – then they must be similar – and they have a small distance from each other representations
import numpy as np
from scipy.spatial.distance import cosine

with open('https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/words.txt') as f:
    words = {}
    lines = f.readlines()
    for line in lines:
        row = line.split()
        word = row[0]
        vector = np.array([float(x) for x in row[1:]])
        words[word] = vector

def distance(word1, word2):
    return cosine(word1, word2)

def closest_words(word):
    distances = {w: distance(word, words[w]) for w in words}
    return sorted(distances, key=lambda w: distances[w])[:10]

This will amaze you. But first let’s see what it does.

distance(words['king'], words['queen'])

Gives 0.19707422881543946. Some number that does not give much sense.

distance(words['king'], words['pope'])

Giving 0.42088794105426874. Again, not much of value.

closest_words(words['king'] - words['man'] + words['woman'])




Why do I say wow?

Well, king – man + woman becomes queen.

If that is not amazing?

Want to learn more?

This is part of a FREE 10h Machine Learning course with Python.

  • 15 video lessons – which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).

Implement a Term Frequency by Inverse Document Frequency (TF-IDF) with NLTK

What will we cover?

  • Learn what Information Retrieval is
  • Topic modeling documents
  • How to use Term Frequency and understand the limitations
  • Implement Term Frequency by Inverse Document Frequency (TF-IDF)

Step 1: What is Information Retrieval (IR)?

The task of finding relevant documents in response to a user query. Web search engines are the most visible IR applications (wiki).

Topic modeling is a model for discovering the topics for a set of documents, e.g., it can provide us with methods to organize, understand and summarize large collections of textual information.

Topic modeling can be described as a method for finding a group of words that best represent the information.

Step 2: Approach 1: Term Frequency

Term Frequency is the number of times a term occurs in a document is called its term frequency (wiki).

tf(𝑡,𝑑)=𝑓𝑡,𝑑: The number of time term 𝑡 occurs in document 𝑑.

There are other ways to define term frequency (see wiki).

Let’s try to write some code to explore this concept.

To follow this code you need to download the files here from here: GitHub link. You can download them as a zip file from here: Zip-download.

import os
import nltk
import math

corpus = {}

# Count the term frequencies
for filename in os.listdir('files/holmes/'):
    with open(f'files/holmes/{filename}') as f:
        content = [word.lower() for word in nltk.word_tokenize(f.read()) if word.isalpha()]
        freq = {word: content.count(word) for word in set(content)}
        corpus[filename] = freq

for filename in corpus:
    corpus[filename] = sorted(corpus[filename].items(), key=lambda x: x[1], reverse=True)

for filename in corpus:
    for word, score in corpus[filename][:5]:
        print(f'  {word}: {score}')

This will output (only sample output).

  the: 600
  and: 281
  of: 276
  a: 252
  i: 233
  the: 326
  i: 298
  and: 226
  to: 185
  a: 173

We see that the words most used in each documents are so called stop-word.

  • words that have little meaning on their own (wiki)
  • Examples: am, by, do, is, which, ….
  • Student exercise: Remove function words and see result (HINT: nltk has a list of stopwords)

What you will discover if you remove all stop-words, then you will still not get anything very useful. There are some words that are just more common.

Step 3: Approach 2: TF-IDF

TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. (wiki)

Inverse Document Frequency

  • Measure of how common or rare a word is across documents

idf(𝑡,𝐷)=log𝑁|𝑑∈𝐷:𝑡∈𝑑|=log(Total Documents / Number of Documents Containing “term”)

  • 𝐷: All documents in the corpus
  • 𝑁: total number of documents in the corpus 𝑁=|𝐷|


Ranking of what words are important in a document by multiplying Term Frequencey (TF) by Inverse Document Frequency (IDF)


Let’s make a small example.

doc1 = "This is the sample of the day".split()
doc2 = "This is another sample of the day".split()

corpus = [doc1, doc2]

tf1 = {word: doc1.count(word) for word in set(doc1)}
tf2 = {word: doc2.count(word) for word in set(doc2)}

term = 'another'
ids = 2/sum(term in doc for doc in corpus)

tf1.get(term, 0)*ids, tf2.get(term, 0)*ids

Want to learn more?

If you watch the YouTube video you will see how to do it for a bigger corpus of files.

This is part of a FREE 10h Machine Learning course with Python.

  • 15 video lessons – which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).

Naive Bayes’ Rule for Sentiment Classification with Full Explanation

What will we cover?

  • What is Text Categorization
  • Learn about the Bag-of-Words Model
  • Understand Naive Bayes’ Rule
  • How to use Naive Bayes’ Rule for sentiment classification (text categorization)
  • What problem smoothing solves

Step 1: What is Text Categorization?

Text categorization (a.k.a. text classification) is the task of assigning predefined categories to free-text documents. It can provide conceptual views of document collections and has important applications in the real world.


Exampels of Text Categorization includes.

  • Inbox vs Spam
  • Product review: Positive vs Negtive review

Step 2: What is the Bag-of-Words model?

We have already learned from Context-Free Grammars, that understanding the full structure of language is not efficient or even possible for Natural Language Processing. One approach was to look at trigrams (3 consecutive words), which can be used to learn about the language and even generate sentences.

Another approach is the Bag-of-Words model.

The Bag-of-Words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.


What does that all mean?

  • The structure is not important
  • Works well to classify
  • Example could be
    • love this product.
    • This product feels cheap.
    • This is the best product ever.

Step 3: What is Naive Bayes’ Classifier?

Naive Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong (naïve) independence assumptions between the features (wiki).

Bayes’ Rule Theorem

Describes the probability of an event, based on prior knowledge of conditions that might be related to the event (wiki).

𝑃(𝑏|𝑎) = 𝑃(𝑎|𝑏)𝑃(𝑏) / 𝑃(𝑎)

Explained with Example

What is the probability that the sentiment is positive giving the sentence “I love this product”. This can be expressed as follows.

𝑃(positive|”I love this product”)=𝑃(positive|”I”, “love”, “this”, “product”)

Bayes’s Rule implies it is equal to

𝑃(“I”, “love”, “this”, “product”|positive)𝑃(positive) / 𝑃(“I”, “love”, “this”, “product”)

Or proportional to

𝑃(“I”, “love”, “this”, “product”|positive)𝑃(positive)

The ‘Naive‘ part we use this to simplify


Ant then we have that.

𝑃(positive) = number of positive samples number of samples.

𝑃(“love”|positive) = number of positive samples with “love”number of positive samples.

Let’s try a more concrete example.

𝑃(positive)𝑃(“I”|positive)𝑃(“love”|positive)𝑃(“this”|positive)𝑃(“product”|positive) = 0.47∗0.30∗0.40∗0.28∗0.25=0.003948

𝑃(negative)𝑃(“I”|negative)𝑃(“love”|negative)𝑃(“this”|negative)𝑃(“product”|negative)=0.53∗0.20∗0.05∗0.42∗0.28 = 0.00062328

Calculate the likelyhood

“I love this product” is positive: 0.00394 / (0.00394 + 0.00062328) = 86.3%

“I love this product” is negative: 0.00062328 / (0.00394 + 0.00062328) = 13.7%

Step 4: The Problem with Naive Bayes’ Classifier?


If a word never showed up in a sentence, then this will result in a probability of zero. Say, in the above example that the word “product” was not represented in a positive sentence. This would imply that the probability P(“product” | positive) = 0, which would imply that the calculations for “I love this product” is positive would be 0.

There are different approaches to deal with this problem.

Additive Smoothing

Adding a value to each value in the distribution to smooth the data. This is straight forward, this ensures that even if the word “product” never showed up, then it will not create a 0 value.

Laplace smoothing

Adding 1 to each value in the distribution. This is just a concrete example of adding 1 to it.

Step 5: Use NLTK to classify sentiment

We already introduced the NLTK, which we will use here.

import nltk
import pandas as pd

data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/sentiment.csv')

def extract_words(document):
    return set(
        word.lower() for word in nltk.word_tokenize(document)
        if any(c.isalpha() for c in word)

words = set()

for line in data['Text'].to_list():

features = []
for _, row in data.iterrows():
    features.append(({word: (word in row['Text']) for word in words}, row['Label']))

classifier = nltk.NaiveBayesClassifier.train(features)

This creates a classifier (based on a small dataset, don’t expect magic).

To use it, try the following code.

s = input()

feature = {word: (word in extract_words(s)) for word in words}

result = classifier.prob_classify(feature)

for key in result.samples():
    print(key, result.prob(key))

Example could be if you input “this was great”.

this was great
 Negative 0.10747100603951745
 Positive 0.8925289939604821

Want to learn more?

If you followed the video you would also be introduced to a project where we create a sentiment classifier on a big twitter corpus.

This is part of a FREE 10h Machine Learning course with Python.

  • 15 video lessons – which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).

How to use Natural Language Processing for Trigrams

What will we cover?

  • How the simple syntax of language can be parsed
  • What Context-Free Grammar (CFG) is
  • Use it to parse text
  • Understand text in trigrams
  • A brief look at Markov Chains
  • See how it can be used to generate predictions

Step 1: What is Natural Language Processing?

Natural language processing (NLP) is a subfield of linguisticscomputer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of “understanding” the contents of documents, including the contextual nuances of the language within them.


Simply said, NLP is automatic computational processing of human language.

This includes.

  • Algorithms that take human written language as input
  • Algorithms that produce natural text

And some examples include.

  • Automatic summarization
  • Language identification
  • Translation

Step 2: What is Context-Free Grammar (CFG)?

What is a Syntax?

One basic description of a language’s syntax is the sequence in which the subject, verb, and object usually appear in sentences.

What is a Formal Grammar?

A system of rules for generating sentences in a language and a grammar is usually thought of as a language generator (wiki).

What is a Context-Free Grammar (CFG)?

A formal grammar is “context free” if its production rules can be applied regardless of the context of a nonterminal (wiki).

Step 3: How to use NLTK and see the Challenge with CFG

What is NLTK?

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources.

You can install it by the following command.

pip install nltk

Notice, that you can do that inside from you JuPyter Notebook with this command.

!pip install nltk

Let’s write a CFG and understand the challenge working with language like that.

import nltk

grammar = nltk.CFG.fromstring("""
    S -> NP VP

    NP -> D N | N
    VP -> V | V NP

    D -> "the" | "a"
    N -> "she" | "city" | "car"
    V -> "saw" | "walked"    

parser = nltk.ChartParser(grammar)

sentence = input().split()

for tree in parser.parse(sentence):

If you run that code and type: she saw a car then you will get the following.

Think about CFG’s this way – if you are a computer, yes, you can generate all these trees representing the CFG – but there is a challenge.

You need to encode all possibilities. That is, the above grammar only understand the encoded words.

To have a full language grammar, it becomes very complex – or should we say – impossible.

What to do then?

Step 4: Use N-grams to understand language

The idea behind n-grams is to understand a small subset of the language. Not to focus on the bigger picture, but just a small subset of it.

You could set up as follows.

  • 𝑛-gram
    • a contiguous sequence of 𝑛n items from a sample text
  • Word 𝑛-gram
    • a contiguous sequence of 𝑛n words from a sample text
  • unigram
    • 1 items in sequence
  • bigram
    • 2 items in sequence
  • trigram
    • 3 items in sequence

We will focus on 3-grams – and the reason for that is if you need 4-grams or above, then you need a lot of text to make it useful.

Again, a trigram is taking 3-word contexts and looking at that isolated.

Let’s try to work with that.

Step 5: Word Tokenization

Word Tokenization is the task of splitting a sequence of words into tokens. This makes further processing easier.

Notice, we need to consider commas, punctuations etc.

To follow this code you need to download the files here from here: GitHub link. You can download them as a zip file from here: Zip-download.

Here we read all the content and tokenize it.

import os
from collections import Counter

# You need to download this

content = []
for filename in os.listdir('files/holmes/'):
    with open(f'files/holmes/{filename}') as f:

corpus = []
for item in content:
    corpus.extend([word.lower() for word in nltk.word_tokenize(item) if any(c.isalpha() for c in word)])

Now we have all the tokens in the corpus.

Step 6: Generating trigrams from the corpus

Now it is straight forward to generate trigrams from the corpus.

ngrams = Counter(nltk.ngrams(corpus, 3))

What to use it for?

Well, you can look for which 3 words are most likely to in a sequence.

for ngram, freq in ngrams.most_common(10):
    print(f'{freq}: {ngram}')

Giving the following output.

80: ('it', 'was', 'a')
71: ('one', 'of', 'the')
65: ('i', 'think', 'that')
59: ('out', 'of', 'the')
55: ('that', 'it', 'was')
55: ('that', 'he', 'had')
55: ('there', 'was', 'a')
55: ('that', 'he', 'was')
52: ('it', 'is', 'a')
49: ('i', 'can', 'not')

First time I saw that, I don’t think I really appreciated the full aspect of that. But actually, you can learn a lot from that. If you look into the project (see YouTube video), then you are will see you can predict who is the person behind a Twitter account.

Yes, that is right. You will be surprised.

Step 7: What is Markov Models

What is the next step?

Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous even (wiki)

That is exactly the next step of what we did before.

Given any two words, then we have created probabilities of the next word.

This can be done by using the markovify library. Install it as follows.

pip install markovify

Then you can create an example like this.

import markovify

with open('https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/shakespeare.txt') as f:
    text = f.read()

model = markovify.Text(text)

This will generate a random sentence from that idea.

'In the wars; defeat thy favor with an ordinary pitch, Who else but I, his forlorn duchess, Was made much poorer by it; but first, how get hence.'

Maybe not that good.

Want to learn more?

This is part of a FREE 10h Machine Learning course with Python.

  • 15 video lessons – which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).
Exit mobile version