Why It’s Great to Master Data Science:
Unlock the potential of data science and embark on a rewarding journey as a skilled data scientist. Master the data science workflow and unleash the power of data-driven insights.
Did you know you check your phone 58 times per day?
Let’s say you are awake 16 hours – that is, you check your phone every 17 minutes during all your waking hours.
Estimates approximate that 66% of all smartphone users are addicted to their phones.
Does that surprise you?
How do we know that?
Data.
We live in a world where you know that the above statements are possibly not wild guesses, there is data to confirm them.
This tutorial is not about helping your phone addiction – it is about Data Science.
With a world full of data you can learn just about anything, make your own analysis and understand the aspects better. You can help make data driven decisions, to avoid blind guesses.
This is one reason to love Data Science.
The key to success in Data Science is understanding the problem. Get the right question.
What is the problem we try to solve? This will form the Data Science Problem.
Examples
Part of understanding the problem included to asses the situation – this will help you understand your context, your problem better.
In the end, it is all about defining the object of your Data Science research. What is the success criteria?
The key to a successful Data Science project is to understand the object and success criteria, this will guide you in your search to understand the research better.
Most get Data Science wrong!
At least, at first.
Deadly wrong!
The assume – not to blame them – that Data Science is about knowing the most tools to solve the problem.
This series of tutorials will teach you something different.
The key to a successful Data Scientist is to understand the Data Science Workflow.
Looking at the above flow – you will realize, that most beginners only focus on a narrow aspect of it.
That is a big mistake – the real value is in step 5, where you use the insight to make measurable goals from data driven insights.
Let’s take an example of how a simple Data Science Workflow could be.
Now, while this looks straight forward – the can be many iterations back into a previous step. Even at step 5, you can consult the client and realize you need more data and start another iteration from step 1, to enrich the process again.
To get started with a simple project, we will explore the Portuguese high school student dataset from Kaggle.
It consists of features and targets.
The features are column data for each student. That is, each studen as a row in the dataset, and each row has data for each of the features.
The the target is what we want to predict from student data.
That is, given a row of features, can we predict the targets.
Here we will look at a smaller problem.
Yes – we need to explore the data and get ideas on how to help the students to get higher grades.
Now, let’s explore our Data Science Workflow.
We need to understand a bit about the context.
We have an idea about these things, not exact figures, but we have an idea about the age (high school students). This tells us what kind of activities we should propose. If it were kids in age 8-10 years, we should propose something different.
What is possible – well, your imagination must guide you with your rational mind. Also, what is the budget – we cannot propose ideas which are too expensive for a normal high school budget.
Let’s get started with some code, to get acquainted with the data.
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/student-mat.csv')
print(len(data))
We will see it has 395 students in the dataset.
print(data.head())
print(data.columns)
This will show the first 5 lines of the dataset as well as the columns. The columns contains the feature and targets.
This step is also about understand if data quality is as expected. We will learn a lot more about this later.
For now explore the data types of the columns.
print(data.dtypes)
This will print out the data types. We see some are integers (int64) others are objects (that is strings/text in this case).
school object
sex object
age int64
address object
famsize object
Pstatus object
Medu int64
Fedu int64
Mjob object
Fjob object
reason object
guardian object
traveltime int64
studytime int64
failures int64
schoolsup object
famsup object
paid object
activities object
nursery object
higher object
internet object
romantic object
famrel int64
freetime int64
goout int64
Dalc int64
Walc int64
health int64
absences int64
G1 int64
G2 int64
G3 int64
dtype: object
And if there are any missing values.
print(data.isnull().any())
The output below tells us (all the False values) that there is no missing data.
school False
sex False
age False
address False
famsize False
Pstatus False
Medu False
Fedu False
Mjob False
Fjob False
reason False
guardian False
traveltime False
studytime False
failures False
schoolsup False
famsup False
paid False
activities False
nursery False
higher False
internet False
romantic False
famrel False
freetime False
goout False
Dalc False
Walc False
health False
absences False
G1 False
G2 False
G3 False
dtype: bool
We are interested to see what has impact on end grades (G3). We can use correlation for that.
For now, correlation is just a number saying if something is correlated or not.
A correlation number is between (including both) -1 and 1. If close to -1 or 1 (that is not close to 0), then it is correlated.
print(data.corr())
age -0.161579
Medu 0.217147
Fedu 0.152457
traveltime -0.117142
studytime 0.097820
failures -0.360415
famrel 0.051363
freetime 0.011307
goout -0.132791
Dalc -0.054660
Walc -0.051939
health -0.061335
absences 0.034247
G1 0.801468
G2 0.904868
G3 1.000000
Name: G3, dtype: float64
This shows us to learnings.
First of all, the grades G1, G2, and G3 are highly correlated, while almost non of the others are.
Second, it only considers the numeric features.
But how can we use non-numeric features you might ask.
Let’s consider the feature higher (wants to take higher education (binary: yes or no)).
print(data.groupby('higher')['G3'].mean())
This gives.
higher
no 6.800
yes 10.608
Name: G3, dtype: float64
This shows that this is a good indicator of whether a student gets good or bad grades. That is, if we assume the questions were asked in the beginning at high school, you can say that students answering no will get 6.8, while students answering yes till get 10.6 on average (grades are in range 0 – 20).
That is a big indicator.
But how many are in each group?
You can get that by.
print(data.groupby('higher')['G3'].count())
Resulting in.
higher
no 20
yes 375
Name: G3, dtype: int64
Now, that is not many. But maybe this is good enough. Finding 20 students which we really can help improve grades.
Later we will learn more about standard deviation, but for now we leave our analysis at this.
This is about how to present our results. We have learned nothing visual yet, so we will keep it simple.
We cannot do much more than present the findings.
higher mean grades
no 6.800
yes 10.608
higher count
no 20
yes 375
I am sure you can make a nicer power point presentation than this.
Now this is where we need to find ideas. We have identified 20 students, now we need to find activities that the high school can use to improve.
This is where I will let it be your ideas.
How can you measure?
Well, one way is to collect the same data each year and see if the activities have impact.
Now, you can probably do better than I did. Hence, I encourage you to play around with the dataset and find better indicators to get ideas to awesome activities.
Want to learn more about Data Science to become a successful Data Scientist?
In the next lesson you will learn how to Master Data Visualization for 3 Purposes as Data Scientist in this Data Science course.
This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.
Build and Deploy an AI App with Python Flask, OpenAI API, and Google Cloud: In…
Python REST APIs with gcloud Serverless In the fast-paced world of application development, building robust…
App Development with Python using Docker Are you an aspiring app developer looking to level…
Why Value-driven Data Science is the Key to Your Success In the world of data…
Harnessing the Power of Project-Based Learning and Python for Machine Learning Mastery In today's data-driven…
Is Python the right choice for Machine Learning? Should you learn Python for Machine Learning?…