Master the Data Science Workflow Blueprint to Get Measurable Data Driven Impact

What will we cover?

In this tutorial we will cover the Data Science Workflow and…

  • Why Data Science?
  • Understand the Problem as a Data Scientist.
  • The Data Science Workflow
  • Explore it with a Student Grade Prediction problem.

We will use Python and pandas with our initial Data Science problem.

Part 1: Why Data Science?

Did you know you check your phone 58 times per day?

Let’s say you are awake 16 hours – that is, you check your phone every 17 minutes during all your waking hours.

Estimates approximate that 66% of all smartphone users are addicted to their phones.

Does that surprise you?

How do we know that?

Data.

We live in a world where you know that the above statements are possibly not wild guesses, there is data to confirm them.

This tutorial is not about helping your phone addiction – it is about Data Science.

With a world full of data you can learn just about anything, make your own analysis and understand the aspects better. You can help make data driven decisions, to avoid blind guesses.

This is one reason to love Data Science.

How did Data Science start?

Part 2: Understanding the problem in Data Science

The key to success in Data Science is understanding the problem. Get the right question.

What is the problem we try to solve? This will form the Data Science Problem.

Examples

  • Sales figure and call center logs: evaluate a new product
  • Sensor data from multiple sensors: detect equipment failure
  • Customer data + marketing data: better targeted marketing

Part of understanding the problem included to asses the situation – this will help you understand your context, your problem better.

In the end, it is all about defining the object of your Data Science research. What is the success criteria?

The key to a successful Data Science project is to understand the object and success criteria, this will guide you in your search to understand the research better.

Part 3: Data Science Workflow

Most get Data Science wrong!

At least, at first.

Deadly wrong!

The assume – not to blame them – that Data Science is about knowing the most tools to solve the problem.

This series of tutorials will teach you something different.

The key to a successful Data Scientist is to understand the Data Science Workflow.

Data Science Workflow

Looking at the above flow – you will realize, that most beginners only focus on a narrow aspect of it.

That is a big mistake – the real value is in step 5, where you use the insight to make measurable goals from data driven insights.

Let’s take an example of how a simple Data Science Workflow could be.

  • Step 1
    • Problem: Predict weather tomorrow
    • Data: Time series on Temperateture, Air pressure, Humidity, Rain, Wind speed, Wind direction, etc.
    • Import: Collect data from sources
  • Step 2
    • Explore: Data quality
    • Visualize: A great way to understand data
    • Cleaning: Handle missing or faulty data
  • Step 3
  • Step 4
    • Present: Weather forecast
    • Visualize: Charts, maps, etc.
    • Credibility: Inaccurate results, too high confidence, not presenting full findings
  • Step 5
    • Insights: What to wear, impact on outside events, etc.
    • Impact: Sales and weather forecast (umbrella, ice cream, etc.)
    • Main goal: This is what makes Data Science valuable

Now, while this looks straight forward – the can be many iterations back into a previous step. Even at step 5, you can consult the client and realize you need more data and start another iteration from step 1, to enrich the process again.

Part 4: Student Grade Prediction

To get started with a simple project, we will explore the Portuguese high school student dataset from Kaggle.

It consists of features and targets.

The features are column data for each student. That is, each studen as a row in the dataset, and each row has data for each of the features.

Features

The the target is what we want to predict from student data.

That is, given a row of features, can we predict the targets.

Target

Here we will look at a smaller problem.

Problem: Propose activities to improve G3 grades.

Our Goal

  • To guide the school how they helps students getting higher grades

Yes – we need to explore the data and get ideas on how to help the students to get higher grades.

Now, let’s explore our Data Science Workflow.

Step 1: Acquire

  • Explore problem
  • Identify data
  • Import data

Get the right questions

  • This forms the data science problem
  • What is the problem

We need to understand a bit about the context.

Understand context

  • Student age?
  • What is possible?
  • What is the budget?

We have an idea about these things, not exact figures, but we have an idea about the age (high school students). This tells us what kind of activities we should propose. If it were kids in age 8-10 years, we should propose something different.

What is possible – well, your imagination must guide you with your rational mind. Also, what is the budget – we cannot propose ideas which are too expensive for a normal high school budget.

Let’s get started with some code, to get acquainted with the data.

import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/student-mat.csv')
print(len(data))

We will see it has 395 students in the dataset.

print(data.head())
print(data.columns)

This will show the first 5 lines of the dataset as well as the columns. The columns contains the feature and targets.

Step 2: Prepare

  • Explore data
  • Visualize ideas
  • Cleaning data

This step is also about understand if data quality is as expected. We will learn a lot more about this later.

For now explore the data types of the columns.

print(data.dtypes)

This will print out the data types. We see some are integers (int64) others are objects (that is strings/text in this case).

school        object
sex           object
age            int64
address       object
famsize       object
Pstatus       object
Medu           int64
Fedu           int64
Mjob          object
Fjob          object
reason        object
guardian      object
traveltime     int64
studytime      int64
failures       int64
schoolsup     object
famsup        object
paid          object
activities    object
nursery       object
higher        object
internet      object
romantic      object
famrel         int64
freetime       int64
goout          int64
Dalc           int64
Walc           int64
health         int64
absences       int64
G1             int64
G2             int64
G3             int64
dtype: object

And if there are any missing values.

print(data.isnull().any())

The output below tells us (all the False values) that there is no missing data.

school        False
sex           False
age           False
address       False
famsize       False
Pstatus       False
Medu          False
Fedu          False
Mjob          False
Fjob          False
reason        False
guardian      False
traveltime    False
studytime     False
failures      False
schoolsup     False
famsup        False
paid          False
activities    False
nursery       False
higher        False
internet      False
romantic      False
famrel        False
freetime      False
goout         False
Dalc          False
Walc          False
health        False
absences      False
G1            False
G2            False
G3            False
dtype: bool

Step 3: Analyze

  • Feature selection
  • Model selection
  • Analyze data

We are interested to see what has impact on end grades (G3). We can use correlation for that.

For now, correlation is just a number saying if something is correlated or not.

A correlation number is between (including both) -1 and 1. If close to -1 or 1 (that is not close to 0), then it is correlated.

print(data.corr())
age          -0.161579
Medu          0.217147
Fedu          0.152457
traveltime   -0.117142
studytime     0.097820
failures     -0.360415
famrel        0.051363
freetime      0.011307
goout        -0.132791
Dalc         -0.054660
Walc         -0.051939
health       -0.061335
absences      0.034247
G1            0.801468
G2            0.904868
G3            1.000000
Name: G3, dtype: float64

This shows us to learnings.

First of all, the grades G1, G2, and G3 are highly correlated, while almost non of the others are.

Second, it only considers the numeric features.

But how can we use non-numeric features you might ask.

Let’s consider the feature higher (wants to take higher education (binary: yes or no)).

print(data.groupby('higher')['G3'].mean())

This gives.

higher
no      6.800
yes    10.608
Name: G3, dtype: float64

This shows that this is a good indicator of whether a student gets good or bad grades. That is, if we assume the questions were asked in the beginning at high school, you can say that students answering no will get 6.8, while students answering yes till get 10.6 on average (grades are in range 0 – 20).

That is a big indicator.

But how many are in each group?

You can get that by.

print(data.groupby('higher')['G3'].count())

Resulting in.

higher
no      20
yes    375
Name: G3, dtype: int64

Now, that is not many. But maybe this is good enough. Finding 20 students which we really can help improve grades.

Later we will learn more about standard deviation, but for now we leave our analysis at this.

Step 4: Report

  • Present findings
  • Visualize results
  • Credibility counts

This is about how to present our results. We have learned nothing visual yet, so we will keep it simple.

We cannot do much more than present the findings.

higher mean grades
no 6.800
yes 10.608

higher count
no 20
yes 375

I am sure you can make a nicer power point presentation than this.

Step 5: Actions

  • Use insights
  • Measure impact
  • Main goal

Now this is where we need to find ideas. We have identified 20 students, now we need to find activities that the high school can use to improve.

This is where I will let it be your ideas.

How can you measure?

Well, one way is to collect the same data each year and see if the activities have impact.

Now, you can probably do better than I did. Hence, I encourage you to play around with the dataset and find better indicators to get ideas to awesome activities.

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).

Learn Information Extraction with Skip-Gram Architecture

What will we cover?

  • What is Information Extraction
  • Extract knowledge from patterns
  • Word representation
  • Skip-Gram architecture
  • To see how words relate to each other (this is surprising)

What is Information Extraction?

Information Extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents (wiki).

Let’s try some different approaches.

Approach 1: Extract Knowledge from Patters

Given data knowledge that is fit together – then try to find patterns.

This is actually a powerful approach. Assume you know that Amazon was founded in 1992 and Facebook was founded in 2004.

A pattern could be be “When {company} was founded in {year},”

Let’s try this in real life.

import pandas as pd
import re
# Reading a knowledge base (here only one entry in the csv file)
books = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/books.csv', header=None)
# Convert to t a list
book_list = books.values.tolist()
# Read some content (here a web-page)
with open('https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/penguin.html') as f:
    corpus = f.read()
corpus = corpus.replace('\n', ' ').replace('\t', ' ')
# Try to look where we find our knowledge to find patters
for val1, val2 in book_list:
    print(val1, '-', val2)
    for i in range(0, len(corpus) - 100, 20):
        pattern = corpus[i:i + 100]
        if val1 in pattern and val2 in pattern:
            print('-:', pattern)

This gives the following.

1984 - George Orwell
-: ge-orwell-with-a-foreword-by-thomas-pynchon/">1984</a></h2>   <h2 class="author">by George Orwell</h
-: eword-by-thomas-pynchon/">1984</a></h2>   <h2 class="author">by George Orwell</h2>    <div class="de
-: hon/">1984</a></h2>   <h2 class="author">by George Orwell</h2>    <div class="desc">We were pretty c
The Help - Kathryn Stockett
-: /the-help-by-kathryn-stockett/">The Help</a></h2>   <h2 class="author">by Kathryn Stockett</h2>    <
-: -stockett/">The Help</a></h2>   <h2 class="author">by Kathryn Stockett</h2>    <div class="desc">Thi

This gives you an idea of some patterns.

prefix = re.escape('/">')
middle = re.escape('</a></h2>   <h2 class="author">by ')
suffix = re.escape('</h2>    <div class="desc">')
regex = f"{prefix}(.{{0,50}}?){middle}(.{{0,50}}?){suffix}"
results = re.findall(regex, corpus)
for result in results:
    print(result)

Giving the following pattern matches with new knowledge.

[('War and Peace', 'Leo Tolstoy'),
 ('Song of Solomon', 'Toni Morrison'),
 ('Ulysses', 'James Joyce'),
 ('The Shadow of the Wind', 'Carlos Ruiz Zafon'),
 ('The Lord of the Rings', 'J.R.R. Tolkien'),
 ('The Satanic Verses', 'Salman Rushdie'),
 ('Don Quixote', 'Miguel de Cervantes'),
 ('The Golden Compass', 'Philip Pullman'),
 ('Catch-22', 'Joseph Heller'),
 ('1984', 'George Orwell'),
 ('The Kite Runner', 'Khaled Hosseini'),
 ('Little Women', 'Louisa May Alcott'),
 ('The Cloud Atlas', 'David Mitchell'),
 ('The Fountainhead', 'Ayn Rand'),
 ('The Picture of Dorian Gray', 'Oscar Wilde'),
 ('Lolita', 'Vladimir Nabokov'),
 ('The Help', 'Kathryn Stockett'),
 ("The Liar's Club", 'Mary Karr'),
 ('Moby-Dick', 'Herman Melville'),
 ("Gravity's Rainbow", 'Thomas Pynchon'),
 ("The Handmaid's Tale", 'Margaret Atwood')]

Approach 2: Skip-Gram Architecture

One-Hot Representation

  • Representation word as a vector with a single 1, and with other values as 0
  • Maybe not useful to have with

Distributed Representation

  • representation of meaning distributed across multiple values

How to define words as vectors

  • Word is defined by what words suround it
  • Based on the context
  • What words happen to show up around it

word2vec

  • model for generating word vectors

Skip-Gram Architecture

  • Neural network architecture for predicting context words given a target word
    • Given a word – what words show up around it in a context
  • Example
    • Given target word (input word) – train the network of which context words (right side)
    • Then the weights from input node (target word) to hidden layer (5 weights) give a representation
    • Hence – the word will be represented by a vector
    • The number of hidden nodes represent how big the vector should be (here 5)
  • Idea is as follows
    • Each input word will get weights to the hidden layers
    • The hidden layers will be trained
    • Then each word will be represented as the weights of hidden layers
  • Intuition
    • If two words have similar context (they show up the same places) – then they must be similar – and they have a small distance from each other representations
import numpy as np
from scipy.spatial.distance import cosine
with open('https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/words.txt') as f:
    words = {}
    lines = f.readlines()
    for line in lines:
        row = line.split()
        word = row[0]
        vector = np.array([float(x) for x in row[1:]])
        words[word] = vector
def distance(word1, word2):
    return cosine(word1, word2)
def closest_words(word):
    distances = {w: distance(word, words[w]) for w in words}
    return sorted(distances, key=lambda w: distances[w])[:10]

This will amaze you. But first let’s see what it does.

distance(words['king'], words['queen'])

Gives 0.19707422881543946. Some number that does not give much sense.

distance(words['king'], words['pope'])

Giving 0.42088794105426874. Again, not much of value.

closest_words(words['king'] - words['man'] + words['woman'])

Giving.

['queen',
 'king',
 'empress',
 'prince',
 'duchess',
 'princess',
 'consort',
 'monarch',
 'dowager',
 'throne']

Wow!

Why do I say wow?

Well, king – man + woman becomes queen.

If that is not amazing?

Want to learn more?

This is part of a FREE 10h Machine Learning course with Python.

  • 15 video lessons – which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).

Implement a Term Frequency by Inverse Document Frequency (TF-IDF) with NLTK

What will we cover?

  • Learn what Information Retrieval is
  • Topic modeling documents
  • How to use Term Frequency and understand the limitations
  • Implement Term Frequency by Inverse Document Frequency (TF-IDF)

Step 1: What is Information Retrieval (IR)?

The task of finding relevant documents in response to a user query. Web search engines are the most visible IR applications (wiki).

Topic modeling is a model for discovering the topics for a set of documents, e.g., it can provide us with methods to organize, understand and summarize large collections of textual information.

Topic modeling can be described as a method for finding a group of words that best represent the information.

Step 2: Approach 1: Term Frequency

Term Frequency is the number of times a term occurs in a document is called its term frequency (wiki).

tf(𝑡,𝑑)=𝑓𝑡,𝑑: The number of time term 𝑡 occurs in document 𝑑.

There are other ways to define term frequency (see wiki).

Let’s try to write some code to explore this concept.

To follow this code you need to download the files here from here: GitHub link. You can download them as a zip file from here: Zip-download.

import os
import nltk
import math
corpus = {}
# Count the term frequencies
for filename in os.listdir('files/holmes/'):
    with open(f'files/holmes/{filename}') as f:
        content = [word.lower() for word in nltk.word_tokenize(f.read()) if word.isalpha()]
        
        freq = {word: content.count(word) for word in set(content)}
        
        corpus[filename] = freq
for filename in corpus:
    corpus[filename] = sorted(corpus[filename].items(), key=lambda x: x[1], reverse=True)
for filename in corpus:
    print(filename)
    for word, score in corpus[filename][:5]:
        print(f'  {word}: {score}')

This will output (only sample output).

speckled.txt
  the: 600
  and: 281
  of: 276
  a: 252
  i: 233
face.txt
  the: 326
  i: 298
  and: 226
  to: 185
  a: 173

We see that the words most used in each documents are so called stop-word.

  • words that have little meaning on their own (wiki)
  • Examples: am, by, do, is, which, ….
  • Student exercise: Remove function words and see result (HINT: nltk has a list of stopwords)

What you will discover if you remove all stop-words, then you will still not get anything very useful. There are some words that are just more common.

Step 3: Approach 2: TF-IDF

TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. (wiki)

Inverse Document Frequency

  • Measure of how common or rare a word is across documents

idf(𝑡,𝐷)=log𝑁|𝑑∈𝐷:𝑡∈𝑑|=log(Total Documents / Number of Documents Containing “term”)

  • 𝐷: All documents in the corpus
  • 𝑁: total number of documents in the corpus 𝑁=|𝐷|

TF-IDF

Ranking of what words are important in a document by multiplying Term Frequencey (TF) by Inverse Document Frequency (IDF)

tf-idf(𝑡,𝑑)=tf(𝑡,𝑑)⋅idf(𝑡,𝐷)

Let’s make a small example.

doc1 = "This is the sample of the day".split()
doc2 = "This is another sample of the day".split()
corpus = [doc1, doc2]
tf1 = {word: doc1.count(word) for word in set(doc1)}
tf2 = {word: doc2.count(word) for word in set(doc2)}
term = 'another'
ids = 2/sum(term in doc for doc in corpus)
tf1.get(term, 0)*ids, tf2.get(term, 0)*ids

Want to learn more?

If you watch the YouTube video you will see how to do it for a bigger corpus of files.

This is part of a FREE 10h Machine Learning course with Python.

  • 15 video lessons – which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).