Mastering Information Extraction offers several advantages in the field of natural language processing and text analysis:
In this tutorial on Information Extraction, we will cover the following topics:
Information Extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents (wiki).
Let’s try some different approaches.
Given data knowledge that is fit together – then try to find patterns.
This is actually a powerful approach. Assume you know that Amazon was founded in 1992 and Facebook was founded in 2004.
A pattern could be be “When {company} was founded in {year},”
Let’s try this in real life.
import pandas as pd
import re
# Reading a knowledge base (here only one entry in the csv file)
books = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/books.csv', header=None)
# Convert to t a list
book_list = books.values.tolist()
# Read some content (here a web-page)
with open('https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/penguin.html') as f:
corpus = f.read()
corpus = corpus.replace('\n', ' ').replace('\t', ' ')
# Try to look where we find our knowledge to find patters
for val1, val2 in book_list:
print(val1, '-', val2)
for i in range(0, len(corpus) - 100, 20):
pattern = corpus[i:i + 100]
if val1 in pattern and val2 in pattern:
print('-:', pattern)
This gives the following.
1984 - George Orwell
-: ge-orwell-with-a-foreword-by-thomas-pynchon/">1984</a></h2> <h2 class="author">by George Orwell</h
-: eword-by-thomas-pynchon/">1984</a></h2> <h2 class="author">by George Orwell</h2> <div class="de
-: hon/">1984</a></h2> <h2 class="author">by George Orwell</h2> <div class="desc">We were pretty c
The Help - Kathryn Stockett
-: /the-help-by-kathryn-stockett/">The Help</a></h2> <h2 class="author">by Kathryn Stockett</h2> <
-: -stockett/">The Help</a></h2> <h2 class="author">by Kathryn Stockett</h2> <div class="desc">Thi
This gives you an idea of some patterns.
prefix = re.escape('/">')
middle = re.escape('</a></h2> <h2 class="author">by ')
suffix = re.escape('</h2> <div class="desc">')
regex = f"{prefix}(.{{0,50}}?){middle}(.{{0,50}}?){suffix}"
results = re.findall(regex, corpus)
for result in results:
print(result)
Giving the following pattern matches with new knowledge.
[('War and Peace', 'Leo Tolstoy'),
('Song of Solomon', 'Toni Morrison'),
('Ulysses', 'James Joyce'),
('The Shadow of the Wind', 'Carlos Ruiz Zafon'),
('The Lord of the Rings', 'J.R.R. Tolkien'),
('The Satanic Verses', 'Salman Rushdie'),
('Don Quixote', 'Miguel de Cervantes'),
('The Golden Compass', 'Philip Pullman'),
('Catch-22', 'Joseph Heller'),
('1984', 'George Orwell'),
('The Kite Runner', 'Khaled Hosseini'),
('Little Women', 'Louisa May Alcott'),
('The Cloud Atlas', 'David Mitchell'),
('The Fountainhead', 'Ayn Rand'),
('The Picture of Dorian Gray', 'Oscar Wilde'),
('Lolita', 'Vladimir Nabokov'),
('The Help', 'Kathryn Stockett'),
("The Liar's Club", 'Mary Karr'),
('Moby-Dick', 'Herman Melville'),
("Gravity's Rainbow", 'Thomas Pynchon'),
("The Handmaid's Tale", 'Margaret Atwood')]
import numpy as np
from scipy.spatial.distance import cosine
with open('https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/words.txt') as f:
words = {}
lines = f.readlines()
for line in lines:
row = line.split()
word = row[0]
vector = np.array([float(x) for x in row[1:]])
words[word] = vector
def distance(word1, word2):
return cosine(word1, word2)
def closest_words(word):
distances = {w: distance(word, words[w]) for w in words}
return sorted(distances, key=lambda w: distances[w])[:10]
This will amaze you. But first let’s see what it does.
distance(words['king'], words['queen'])
Gives 0.19707422881543946. Some number that does not give much sense.
distance(words['king'], words['pope'])
Giving 0.42088794105426874. Again, not much of value.
closest_words(words['king'] - words['man'] + words['woman'])
Giving.
['queen',
'king',
'empress',
'prince',
'duchess',
'princess',
'consort',
'monarch',
'dowager',
'throne']
Wow!
Why do I say wow?
Well, king – man + woman becomes queen.
If that is not amazing?
This is was the last lesson of the 15 machine learning projects.
This is part of a FREE 10h Machine Learning course with Python.
Build and Deploy an AI App with Python Flask, OpenAI API, and Google Cloud: In…
Python REST APIs with gcloud Serverless In the fast-paced world of application development, building robust…
App Development with Python using Docker Are you an aspiring app developer looking to level…
Why Value-driven Data Science is the Key to Your Success In the world of data…
Harnessing the Power of Project-Based Learning and Python for Machine Learning Mastery In today's data-driven…
Is Python the right choice for Machine Learning? Should you learn Python for Machine Learning?…
View Comments
Hi,
I intend to contribute a guest post to your website that will help you get good traffic as well as interest your readers.
Shall I send you the topics then?
Best,
Kathelene Paul
I have sent you an email.