Learn Information Extraction with Skip-Gram Architecture

What will we cover?

  • What is Information Extraction
  • Extract knowledge from patterns
  • Word representation
  • Skip-Gram architecture
  • To see how words relate to each other (this is surprising)

What is Information Extraction?

Information Extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents (wiki).

Let’s try some different approaches.

Approach 1: Extract Knowledge from Patters

Given data knowledge that is fit together – then try to find patterns.

This is actually a powerful approach. Assume you know that Amazon was founded in 1992 and Facebook was founded in 2004.

A pattern could be be “When {company} was founded in {year},”

Let’s try this in real life.

import pandas as pd
import re

# Reading a knowledge base (here only one entry in the csv file)
books = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/books.csv', header=None)

# Convert to t a list
book_list = books.values.tolist()

# Read some content (here a web-page)
with open('https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/penguin.html') as f:
    corpus = f.read()

corpus = corpus.replace('\n', ' ').replace('\t', ' ')

# Try to look where we find our knowledge to find patters
for val1, val2 in book_list:
    print(val1, '-', val2)
    for i in range(0, len(corpus) - 100, 20):
        pattern = corpus[i:i + 100]
        if val1 in pattern and val2 in pattern:
            print('-:', pattern)

This gives the following.

1984 - George Orwell
-: ge-orwell-with-a-foreword-by-thomas-pynchon/">1984</a></h2>   <h2 class="author">by George Orwell</h
-: eword-by-thomas-pynchon/">1984</a></h2>   <h2 class="author">by George Orwell</h2>    <div class="de
-: hon/">1984</a></h2>   <h2 class="author">by George Orwell</h2>    <div class="desc">We were pretty c
The Help - Kathryn Stockett
-: /the-help-by-kathryn-stockett/">The Help</a></h2>   <h2 class="author">by Kathryn Stockett</h2>    <
-: -stockett/">The Help</a></h2>   <h2 class="author">by Kathryn Stockett</h2>    <div class="desc">Thi

This gives you an idea of some patterns.

prefix = re.escape('/">')
middle = re.escape('</a></h2>   <h2 class="author">by ')
suffix = re.escape('</h2>    <div class="desc">')

regex = f"{prefix}(.{{0,50}}?){middle}(.{{0,50}}?){suffix}"
results = re.findall(regex, corpus)

for result in results:
    print(result)

Giving the following pattern matches with new knowledge.

[('War and Peace', 'Leo Tolstoy'),
 ('Song of Solomon', 'Toni Morrison'),
 ('Ulysses', 'James Joyce'),
 ('The Shadow of the Wind', 'Carlos Ruiz Zafon'),
 ('The Lord of the Rings', 'J.R.R. Tolkien'),
 ('The Satanic Verses', 'Salman Rushdie'),
 ('Don Quixote', 'Miguel de Cervantes'),
 ('The Golden Compass', 'Philip Pullman'),
 ('Catch-22', 'Joseph Heller'),
 ('1984', 'George Orwell'),
 ('The Kite Runner', 'Khaled Hosseini'),
 ('Little Women', 'Louisa May Alcott'),
 ('The Cloud Atlas', 'David Mitchell'),
 ('The Fountainhead', 'Ayn Rand'),
 ('The Picture of Dorian Gray', 'Oscar Wilde'),
 ('Lolita', 'Vladimir Nabokov'),
 ('The Help', 'Kathryn Stockett'),
 ("The Liar's Club", 'Mary Karr'),
 ('Moby-Dick', 'Herman Melville'),
 ("Gravity's Rainbow", 'Thomas Pynchon'),
 ("The Handmaid's Tale", 'Margaret Atwood')]

Approach 2: Skip-Gram Architecture

One-Hot Representation

  • Representation word as a vector with a single 1, and with other values as 0
  • Maybe not useful to have with

Distributed Representation

  • representation of meaning distributed across multiple values

How to define words as vectors

  • Word is defined by what words suround it
  • Based on the context
  • What words happen to show up around it

word2vec

  • model for generating word vectors

Skip-Gram Architecture

  • Neural network architecture for predicting context words given a target word
    • Given a word – what words show up around it in a context
  • Example
    • Given target word (input word) – train the network of which context words (right side)
    • Then the weights from input node (target word) to hidden layer (5 weights) give a representation
    • Hence – the word will be represented by a vector
    • The number of hidden nodes represent how big the vector should be (here 5)
  • Idea is as follows
    • Each input word will get weights to the hidden layers
    • The hidden layers will be trained
    • Then each word will be represented as the weights of hidden layers
  • Intuition
    • If two words have similar context (they show up the same places) – then they must be similar – and they have a small distance from each other representations
import numpy as np
from scipy.spatial.distance import cosine

with open('https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/words.txt') as f:
    words = {}
    lines = f.readlines()
    for line in lines:
        row = line.split()
        word = row[0]
        vector = np.array([float(x) for x in row[1:]])
        words[word] = vector

def distance(word1, word2):
    return cosine(word1, word2)

def closest_words(word):
    distances = {w: distance(word, words[w]) for w in words}
    return sorted(distances, key=lambda w: distances[w])[:10]

This will amaze you. But first let’s see what it does.

distance(words['king'], words['queen'])

Gives 0.19707422881543946. Some number that does not give much sense.

distance(words['king'], words['pope'])

Giving 0.42088794105426874. Again, not much of value.

closest_words(words['king'] - words['man'] + words['woman'])

Giving.

['queen',
 'king',
 'empress',
 'prince',
 'duchess',
 'princess',
 'consort',
 'monarch',
 'dowager',
 'throne']

Wow!

Why do I say wow?

Well, king – man + woman becomes queen.

If that is not amazing?

Want to learn more?

This is part of a FREE 10h Machine Learning course with Python.

  • 15 video lessons – which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).

Leave a Reply Cancel reply

Exit mobile version