Pie charts are one of the most powerful visualizations when presenting them. With a few tricks you can make them look professional with a free tool like Matplotlib.
In the end of this tutorial you will know how to make pie charts and customize it even further.
Basic Pie Chart
First you need to make a basic Pie chart with matplotlib.
Actually Data Visualization is an important skill to understand and present data.
This is a key skill in Data Science. If you like to learn more then check my free Expert Data Science Blueprint course with the following resources.
15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
You see that data is not totally scattered all over, but is not fully correlated either. This means, that there is come weak correlation of the data and it is not fully independent of each other.
#4 Box Plot
One way to understand data better is by a box plot. It might need a bit of understanding of simple statistics.
Let’s first take a look at it.
data.plot.box()
The box plot shows the following.
To understand what outliers, min, median, max, and so forth means, I would suggest you read this simple statistic guide.
An area plot can show you the data in a great way to see how the values follow each other in a visual easy way to get an understanding of values, correlation, and missing data.
data.plot.area(figsize=(12,4), subplots=True)
#6 Bar plots
Bar plots can be useful, but often when the data is more limited.
Here you see a bar plot of the first 15 rows of data.
data.iloc[:15].plot.bar()
#7 Histograms for single column
Histograms will show you what data is most common. It shows the frequencies of data divided into bins. By default there are 10 bins of data.
It is an amazing tool to get a fast view of the number of occurrences of each data range.
Here first for an isolated station.
data['station_paris'].plot.hist()
#8 Histograms for multiple columns
Then for all three stations, where you see it with transparency (alpha).
data.plot.hist(alpha=.5)
#9 Pie
Pie charts are very powerful, when you want to show a division of data.
How many percentage belong to each category.
Here you see the mean value of each station.
data.mean().plot.pie()
#10 Scatter Matrix Plot
This is a great tool for showing data for combined in all possible ways. This will show you correlations and how data is distributed.
You need to import an additional library, but it gives you fast understanding of data.
from pandas.plotting import scatter_matrix
scatter_matrix(data, alpha=0.2, figsize=(6, 6))
#11 Secondary y-axis
Finally, sometimes you want two plots on the same chart. The problem can be, that the two plots have very different ranges. hence, you would like to have two different y-axes, with different ranges.
This will enable you to have plots on the same chart with different ranges.
Want to learn more about Data Science to become a successful Data Scientist?
Then check my free Expert Data Science Blueprint course with the following resources.
15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
Everybody likes the pandas data structure DataFrame, but most miss out on what powerful methods it provides.
pandas is a huge module, which makes it difficult to master. Most just use the data structure (DataFrame), without utilizing the power of the methods. In this tutorial you will learn the 15 most useful shortcut that will help you when working with data in pandas data structures.
#1 groupby
The groupby method involves some combination of splitting the object, applying a function, and combining the result.
Wow. That sounds complex. But it is not. It can be used to group large amounts of data and compute operations on these groups.
Best way to learn is to see some example.
import pandas as pd
data = {'Items': ['Apple','Orange', 'Pear', 'Orange', 'Apple'],
'Price': [12, 5, 3, 7, 24]}
df = pd.DataFrame(data)
This results in this DataFrame.
The groupby method can group the items together, and apply a function. Let’s try it here.
df.groupby(['Items']).mean()
This will result in this output.
As you see, it has grouped the Apples, Oranges, and the Pears together and for the price column, it has applied the mean() function on the values.
Hence, the Apple has value 18, as it is the mean of 12 and 24 ((12 + 24)/2). Similar, for Orange and Pear.
#2 memory_usage()
We get more and more data and our project get bigger and bigger. At one point you will need to analyze how much memory your data is using.
What memory_usage does, is, it returns the memory usage of each column in the DataFrame. Sometimes, the data type of a column is object, what that means is, that it is pointing to another object. To get the data usage of these objects included, you need to use the deep=True argument.
Let’s try both, to see the difference.
import pandas as pd
dtypes = ['int64', 'float64', 'complex128', 'object', 'bool']
data = dict([(t, np.ones(shape=1000, dtype=int).astype(t)) for t in
dtypes])
df = pd.DataFrame(data)
print(df.head())
Then we apply the clip, which will ensure the values below -2 are replaced with -2, and values above 5, are replaced with 5. It clips the values.
print(df.clip(-2, 5))
#4 corr()
The correlation between the values in a column can be calculate with corr(). There are different methods to use: Pearson, Kendall, and Spearman. By default it uses the Pearson method, which will do fine giving you an idea if columns are correlated.
I often use it also in a combination with sum(), which will then tell how many rows in each column are missing. This is interesting to get an idea about the quality of the dataset.
print(df.isna().sum())
age 1
born 1
name 0
toy 1
dtype: int64
#10 interpolation()
On the subject of missing values, what to do? Well, there are many options, but one simple can be to interpolate the values.
import pandas as pd
import numpy as np
s = pd.Series([0, 1, np.nan, 3])
This gives the following series.
0 0.0
1 1.0
2 NaN
3 3.0
dtype: float64
Then you can interpolate and get the value between them.
print(s.interpolate())
0 0.0
1 1.0
2 2.0
3 3.0
dtype: float64
This is just one way to deal with it. Dealing with missing values is a big subject. To learn more read this tutorial on the subject.
Here we see we have two rows with 4 and 0 (in that order), while the other rows have unique values.
print(df.value_counts())
num_legs num_wings
4 0 2
2 2 1
6 0 1
dtype: int64
We see there are two rows with 4 and 0, and one of the other rows.
Bonus: unique()
Wanted the unique elements in your Series?
Here you go.
import pandas as pd
s = pd.Series([2, 1, 3, 3], name='A')
print(s.unique())
This will give the unique elements.
array([2, 1, 3])
Want to learn more?
Want to learn more about Data Science to become a successful Data Scientist?
Then check my free Expert Data Science Blueprint course with the following resources.
15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
When it comes to creating a good Data Science Project you will need to ensure you cover a great deal of aspects. This template will show you what to cover and where to find more information on a specific topic.
The common pitfall for most junior Data Scientist is to focus on the very technical part of the Data Science Workflow. To add real value to the clients you need to focus on more steps, which are often neglected.
This guide will walk you through all steps and elaborate and link to in-depth content if you need more explanations.
Step 1: Acquire
Explore problem
Identify data
Import data
Step 1.a: Define Problem
If you are making a hoppy project, there might not be a definition of what you are trying to solve. But it is always good practice to start with it. Otherwise, you will most likely just do what you usually do and feel comfortable about. Try to sit down and figure it out.
It should be clear, that this step is before you have the data. That said, it often happens that a company has data and doesn’t know what to use it for.
Still, it all starts by defining a problem.
Here are some guidelines.
When defining a problem, don’t be too ambitious
Examples:
A green energy windmill producer need to optimize distribution and need better prediction on production based on weather forecasts
An online news media is interested in a story with how CO2 per capita around the world has evolved over the years
Both projects are difficult
For the windmill we would need data on production, maintenance periods, detailed weather data, just to get started.
The data for CO2 per capita is available on World Bank, but creating a visual story is difficult with our current capabilities
Hence, make a better research problem
You can start by considering a dataset and get inspiration
read_csv(): read a comma-separated values (csv) file into pandas DataFrame.import pandas as pd data = pd.read_csv('files/aapl.csv', parse_dates=True, index_col=0)
read_html() Read HTML tables into a list of DataFrame objects.url = "https://en.wikipedia.org/wiki/Wikipedia:Fundraising_statistics" data = pd.read_html(url)
read_sql() Read SQL query or database table into a DataFrame.
The sqlite3 is an interface for SQLite databases.import sqlite3 import pandas as pd conn = sqlite3.connect('files/dallas-ois.sqlite') data = pd.read_sql('SELECT * FROM officers', conn)
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv('files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0)
data['USA'].plot()
Adding title and labels
title='Tilte' adds the title
xlabel='X label' adds or changes the X-label
ylabel='X label' adds or changes the Y-labeldata['USA'].plot(title='US CO2 per capita', ylabel='CO2 (metric tons per capita)')
Adding ranges
xlim=(min, max) or xlim=min Sets the x-axis range
ylim=(min, max) or ylim=min Sets the y-axis rangedata['USA'].plot(title='US CO2 per capita', ylabel='CO2 (metric tons per capita)', ylim=0)
Comparing datadata[['USA', 'WLD']].plot(ylim=0)
Scatter Plot
Good to see any connectiondata = pd.read_csv('files/sample_corr.csv') data.plot.scatter(x='x', y='y')
Divide into training and test setsfrom sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
MinMaxScaler Transform features by scaling each feature to a given range.
MinMaxScaler().fit(X_train) is used to create a scaler.
Notice: We only do it on training datafrom sklearn.preprocessing import MinMaxScaler norm = MinMaxScaler().fit(X_train) X_train_norm = norm.transform(X_train) X_test_norm = norm.transform(X_test)
Standarization
StandardScaler Standardize features by removing the mean and scaling to unit variance.from sklearn.preprocessing import StandardScaler scale = StandardScaler().fit(X_train) X_train_stand = scale.transform(X_train) X_test_stand = scale.transform(X_test)
VarianceThreshold Feature selector that removes all low-variance features.from sklearn.feature_selection import VarianceThreshold sel = VarianceThreshold() sel.fit_transform(data)Remove correlated features
The goal is to find and remove correlated features
Calcualte correlation matrix (assign it to corr_matrix)
A feature is correlated to any previous features if the following is true
Notice that we use correlation 0.8corr_features = [feature for feature in corr_matrix.columns if (corr_matrix[feature].iloc[:corr_matrix.columns.get_loc(feature)] > 0.8).any()]
Review the Problem and Data Science problem you started with.
The analysis should add value to the Data Science Problem
Sometimes our focus drifts – we need to ensure alignment with original Problem.
Go back to the Exploration of the Problem – does the result add value to the Data Science Problem and the initial Problem (which formed the Data Science Problem)
Example: As Data Scientist we often find the research itself valuable, but a business is often interested in increasing revenue, customer satisfaction, brand value, or similar business metrics.
Did we learn anything?
Does the Data-Driven Insights add value?
Example: Does it add value to have evidence for: Wealthy people buy more expensive cars.
This might add you value to confirm this hypothesis, but does it add any value for car manufacturer?
Can we make any valuable insights from our analysis?
Do we need more/better/different data?
Can we give any Actionable Data Driven Insights?
It is always easy to want better and more accurate high quality data.
Do we have the right features?
Do we need eliminate features?
Is the data cleaning appropriate?
Is data quality as expected?
Do we need to try different models?
Data Analysis is an iterative process
Simpler models are more powerful
Can result be inconclusive?
Can we still give recommendations?
Quote
“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”
Sherlock Holmes
Iterative Research Process
Observation/Question: Starting point (could be iterative)
Hypothesis/Claim/Assumption: Something we believe could be true
Test/Data collection: We need to gether relevant data
Analyze/Evidence: Based on data collection did we get evidence?
Can our model predict? (a model is first useful when it can predict)
Conclude: Warning! E.g.: We can conclude a correlation (this does not mean A causes B)
Example: Based on the collected data we can see a correlation between A and B
Step 4: Report
Present findings
Visualize results
Credibility counts
Step 4.a: Present Findings
You need to sell or tell a story with the findings.
Who is your audience?
Focus on technical level and interest of your audience
Speak their language
Story should make sense to audience
Examples
Team manager: Might be technical, but often busy and only interested in high-level status and key findings.
Data engineer/science team: Technical exploration and similar interest as you
Business stakeholders: This might be end-customers or collaboration in other business units.
When presenting
Goal: Communicate actionable insights to key stakeholders
Outline (inspiration):
TL;DR (Too-long; Didn’t read) – clear and concise summary of the content (often one line) that frames key insights in the context of impact on key business metrics.
Start with your understanding of the business problem
How does it transform into a Data Science Problem
How will to measure impact – what business metrics are indicators of results
What data is available and used
Presenting hypthosis of reseach
A visual presentation of the insights (model/analysis/key findings)
This is where you present the evidence for the insights
How to use insight and create actions
Followup and continuous learning increasing value
Step 4.b: Visualize Results
Telling a story with the data
This is where you convince that the findings/insights are correct
The right visualization is important
Example: A correlation matrix might give a Data Engineer insights in how findings where discovered, but confuse business partners.
Resources for visualization
Seaborn Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
Folium makes it easy to visualize data that’s been manipulated in Python on an interactive leaflet map.
Step 4.c: Credibility Counts
This is the check point if your research is valid
Are you hiding findings you did not like (not supporting your hypothesis)?
Remember it is the long-term relationship that counts
Don’t leave out results
We learn from data and find hidden patterns, to make data-driven decisions, with a long-term perspective
Step 5: Actions
Use insights
Measure impact
Main goal
Step 5.a: Use Insights
How do we follow up on the presented Insights?
No one-size-fits-all: It depends on the Insights and Problem
Examples:
Problem: What customers are most likely to cancel subscription?
Say, we have insufficient knowledge of customers, and need to get more, hence we have given recommendations to gather more insights
But you should still try to add value
Problem: Here is our data – find valuable insights!
This is a challenge as there is no given focus
An iterative process involving the customer can leave you with no surprises
Step 5.b: Measure Impact
If customer cannot measure impact of your work – they do not know what they pay for.
If you cannot measure it – you cannot know if hypothesis are correct.
A model is first valuable when it can be used to predict with some certainty
There should be identified metrics/indicators to evaluate in the report
This can evolve – we learn along the way – or we could be wrong.
How long before we expect to see impact on identified business metrics?
What if we do not see expected impact?
Understanding of metrics
The metrics we measure are indicators that our hypthesis is correct
Other aspects can have impact on the result – but you need to identify that
Main Goal
Your success of a Data Scientist is to create valuable actionable insights
A great way to think
Any business/organisation can be thought of as a complex system
Nobody understands it perfectly and it evolves organically
Data describes some aspect of it
It can be thought of as a black-box
Any insights you can bring is like a window that sheds light on what happens inside
Want to learn more?
Want to learn more about Data Science to become a successful Data Scientist?
This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.
15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
This guide will help you to choose the right Machine Learning model for your project. It will also teach you that there is no best model, as all models have predictive error. This means, that you should seek a model that is good enough.
You will learn about Model Selection Techniques like Probabilistic Measures and Resampling Methods.
Step 1: Problem type
The process of selecting the a model for your Machine Learning starts with the type of Problem you work with.
There are 3 high level types of problems.
What kind of problem are you looking into?
Classification: Predict labels on data with predefined classes
Supervised Machine Learning
Clustering: Identify similarieties between objects and group them in clusters
Unsupervised Machine Learning
Regression: Predict continuous values
Supervised Machine Learning
A great guide is the Sklearn cheat sheet, which helps you to narrow down using the problem types.
Step 2: Model Selection Techniques
As said, all models have predictive errors and the goal isn’t to fit a model 100% on your training-test datasets. Your goal is to have create a simple model, which can predict future values.
This means, that you should seek a model that is good enough for the task.
But how do you do that?
You should use a model selection technique to find a good enough model.
Model Selection Techniques
Probabilistic Measures: Scoring by performance and complexity of model.
Resampling Methods: Splitting in sub-train and sub-test datasets and scoring by mean values of repeated runs.
Step 3: Example of testing a model
We will look at a dataset and run a few tests. It will not cover in-depth example of the above methods. But it will tweak it and convert a problem type into another category of type. This can actually sometimes be a good approach.
Hence, we take a Regression problem and turn it into a classification problem.
Even though the data is of a regression type of problem, maybe what you are looking for is not the specific values, and you can turn the problem into a classification problem, and get more valuable results from your model.
Let’s try it.
import pandas as pd
data = pd.read_parquet('https://github.com/LearnPythonWithRune/DataScienceWithPython/raw/main/files/house_sales.parquet')
data['SalePrice'].plot.hist(bins=20)
See more experiments in the video at the top of the page.
Want to learn more?
Want to learn more about Data Science to become a successful Data Scientist?
This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.
15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
It should also be clear that feature selection should only be done on training data, as you should assume no knowledge of the testing data.
Step 4: Filter Method – Quasi-constant features
Let’s try an example by removing quasi-constant features. Those a features that are almost constant. It should be clear that features that are constant all the time do not provide any value. Features that are almost the same value all the time, also provide little value.
To do that we use the following.
Using Sklearn
Remove constant and quasi constant features
VarianceThreshold Feature selector that removes all low-variance features.
import pandas as pd
from sklearn.feature_selection import VarianceThreshold
data = pd.read_parquet('https://github.com/LearnPythonWithRune/DataScienceWithPython/raw/main/files/customer_satisfaction.parquet')
sel = VarianceThreshold(threshold=0.01)
sel.fit_transform(data)
quasi_constant = [col for col in data.columns if col not in sel.get_feature_names_out()]
len(quasi_constant)
This reveals that actually 97 of the features are more than 99% constant.
Step 5: Filter Method – Correlated features
The goal is to find and remove correlated features as they give the same value for the most part. Hence, they do not contribute much.
Calculate correlation matrix (assign it to corr_matrix)
A feature is correlated to any previous features if the following is true
Notice that we use correlation 0.8feature = 'imp_op_var39_comer_ult1' (corr_matrix[feature].iloc[:corr_matrix.columns.get_loc(feature)] > 0.8).any()
Get all the correlated features by using list comprehension
train = data[sel.get_feature_names_out()]
corr_matrix = train.corr()
corr_features = [feature for feature in corr_matrix.columns if (corr_matrix[feature].iloc[:corr_matrix.columns.get_loc(feature)] > 0.8).any()]
This will get the correlated features that are more than 0.8 correlated.
First install it by running the following in a terminal pip install mlxtend
For preparation remove all quasi-constant features and correlated featuresX = data.drop(['TARGET'] + quasi_features + corr_features, axis=1) y = data['TARGET']
To demonstrate this we create a small training setX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.75, random_state=42)
We will use the SVC model with the SequentialFeatureSelector.
For two features
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
X = data.drop(['TARGET'] + quasi_constant + corr_features, axis=1)
y = data['TARGET']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.9, random_state=42)
sfs = SFS(SVC(), k_features=2, verbose=2, cv=2, n_jobs=8)
sfs.fit(X_train, y_train)
Now that shows a few simple ways to make feature selection.
Want to learn more?
Want to learn more about Data Science to become a successful Data Scientist?
This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.
15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
You will be working with a weather dataset and try to predict the weather tomorrow.
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/weather.csv', index_col=0, parse_dates=True)
data.describe()
A subset of the description here.
You will first clean the data in a simple way. If you want to learn about cleaning data check this guide out.
Then we will split the data into train and test. If you want to learn about that – then check out this guide.
from sklearn.model_selection import train_test_split
import numpy as np
data_clean = data.drop(['RISK_MM'], axis=1)
data_clean = data_clean.dropna()
X = data_clean.select_dtypes(include='number')
y = data_clean['RainTomorrow']
y = np.array([0 if value == 'No' else 1 for value in y])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)
Then let’s make a box plot to see the problem with the data.
X_train.plot.box(figsize=(20,5), rot=90)
The problem is that the data is in the same ranges – which makes it difficult for distance based Machine Learning models.
We need to deal with that.
Step 3: Normalization
Normalization transforms data into the same range.
MinMaxScaler Transform features by scaling each feature to a given range.
MinMaxScaler().fit(X_train) is used to create a scaler.
As we see here then all the data is put into the same range form 0 to 1. This has the challenge that you see how the outliers might dominate the picture.
If you want to learn more about box plots and statistics – then see this introduction.
Step 4: Standardization
StandardScaler Standardize features by removing the mean and scaling to unit variance.
This gives that the mean value is 0 and the standard deviation is 1. This can be a great way to deal with data that has a lot of outliers – like this one.
Step 4: Testing it on a Machine Learning model
Let’s test the different approaches on a Machine Learning model.
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
score = []
trainX = [X_train, X_train_norm, X_train_stand]
testX = [X_test, X_test_norm, X_test_stand]
for train, test in zip(trainX, testX):
svc = SVC()
svc.fit(train, y_train)
y_pred = svc.predict(test)
score.append(accuracy_score(y_test, y_pred))
df_svr = pd.DataFrame({'Accuracy score': score}, index=['Original', 'Normalized', 'Standardized'])
df_svr
As you can see that both approaches do better than just leaving the data as it is.
Want to learn more?
Want to learn more about Data Science to become a successful Data Scientist?
This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.
15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
15 video lessons – which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution.
30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
15 projects – with step guides to help you structure your solutions and solution explained in the end of video lessons.
Step 2: How Machine Learning works
Machine learning is divided into two phases.
Phase 1: Learning
Get Data: Identify relevant data for the problem you want to solve. This data set should represent the type of data that the Machine Learn model will use to predict from in Phase 2 (predction).
Pre-processing: This step is about cleaning up data. While the Machine Learning is awesome, it cannot figure out what good data looks like. You need to do the cleaning as well as transforming data into a desired format.
Train model: This is where the magic happens, the learning step (Train model). There are three main paradigms in machine learning.
Supervised: where you tell the algorithm what categories each data item is in. Each data item from the training set is tagged with the right answer.
Unsupervised: is when the learning algorithm is not told what to do with it and it should make the structure itself.
Reinforcement: teaches the machine to think for itself based on past action rewards.
Test model: Finally, the testing is done to see if the model is good. The training data was divided into a test set and training set. The test set is used to see if the model can predict from it. If not, a new model might be necessary.
Phase 2: Prediction
Step 3: What is Supervised Learning
Supervised Learning
Given a dataset of input-output pairs, learn a function to map inputs to outputs
There are different tasks – but we start to focus on Classification
Classification
Supervised learning: the task of learning a function mapping an input point to a descrete category
Step 4: Example with Iris Flower Dataset
The Iris Flower dataset is one of the datasets everyone has to work with.
Consists of three classes: Iris-setosa, Iris-versicolor, and Iris-virginica
Given depedent features can we predict class
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/iris.csv', index_col=0)
print(data.head())
The Machine Learning is divided into a few steps – including dividing it into train and test dataset. The train dataset is used to train the model, while the test dataset is used to check the accuracy of the model.
Steps
Step 1: Assign independent features (those predicting) to X
Step 2: Assign classes (labels/dependent features) to y
Step 3: Divide into training and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 4: Create the modelsvc = SVC()
Step 5: Fit the modelsvc.fit(X_train, y_train)
Step 6: Predict with the modely_pred = svc.predict(X_test)
Step 7: Test the accuracyaccuracy_score(y_test, y_pred)
Code example here.
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
X = data.drop('Species', axis=1)
y = data['Species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
accuracy_score(y_test, y_pred)
Want to learn more about Data Science to become a successful Data Scientist?
This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.
15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
What cleaning data is and how it relates to data quality. This guide will show you how to deal with missing data by replacing and interpolate data. How to deal with data outliers and removing duplicates.
Step 1: What is Clearning Data?
Clearning Data requires domain knowledge of the data.
Data Quality is often a measure of how good data is for further analysis or how solid conclusions we can make. Cleaning data can improve the data quality.
If we understand what is meant by Data Quality – for the data we work with, it becomes easier to clean it. The goal of cleaning is to improve the Data Quality and hence, give better results of our data analysis.
Improve the quality (if possible)
Dealing with missing data (both rows in single entries)
Examples include
Replacing missing values/entries with mean values
Interpolation of values (in time series)
Dealing with data outliers
Examples include
Default missing values in system: sometimes as 0-values
Wrong values
Removing duplicates
Common problem to have duplicate entries
Process requires domain knowledge
Step 2: Missing Data
A common issue of Data Quality is missing data. This can be fields that are missing and are often easy to detect. In pandas DataFrames they are often represented by NA.
The dataset contains a training and testing dataset.
The goal is to predict prices on the testing dataset.
We will explore how dealing with missing values impacts the prediction of a linear regression model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/home-data/train.csv', index_col=0)
data.head()
We can remove non-numeric values in this example as follows and check for missing values afterwards.
data = data.select_dtypes(include='number')
The missing values are listed as follows.
data.info()
(output not given here).
Let’s make a helper function to calculate the r-square score of a linear regression model. This way we can see how the model will behave differently.
One way to find missing rows of data in a time series is as followsidx = pd.Series(data=pd.date_range(start=df.index.min(), end=df.index.max(), freq="H")) mask = idx.isin(df.index) idx[~mask]
Want to learn more about Data Science to become a successful Data Scientist?
This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.
15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
Show what Linear Regression is visually and demonstrate it on data.
Step 1: What is Linear Regression
Simply said, you can describe Linear Regression as follows.
Given data input (independent variables) can we predict output (dependent variable)
It is the mapping from input point to a continuous value
I like to show it visually.
The goal of Linear Regression is to find the best fitting line. Hence, some data will be fitted better as it will be closer to the line.
The predictions will be on the line. That is, when you have fitted your Linear Regression model, it will predict new values to be on the line.
While this sounds simple, the model is one of the most used models and creates high value.
Step 2: Correlation and Linear Regression
Often there is a bit confusing between Linear Regression and Correlation. But they do different things.
Correlation is one number describing a relationship between tow variables. While Linear Regression is an equation used to predict values.
Correlation
Single measure of relationship between two variables.
Linear Regression
An equation used for prediction.
Similarities
Describes relationship between variables
Step 3: Example
Let’s try an example.
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/weight-height.csv')
data.plot.scatter(x='Height', y='Weight', alpha=.1)
This data looks correlated. How would a Linear Regression prediction of it look like?
The Linear Regression model takes a collection of observations
Each observation has featuers (or variables).
The features the model takes as input are called independent (often denoted with X)
The feature the model outputs is called dependent (often denoted with y)
from sklearn.linear_model import LinearRegression
# Creating a Linear Regression model on our data
lin = LinearRegression()
lin.fit(data[['Height']], data['Weight'])
# Creating a plot
ax = data.plot.scatter(x='Height', y='Weight', alpha=.1)
ax.plot(data['Height'], lin.predict(data[['Height']]), c='r')
To measure the accuracy of the prediction the r-squared function is often used, which you can access directly on the model by using the following code.
lin.score(data[['Height']], data['Weight'])
This will give 0.855, which is just a number you can use to compare to other samples.
Want to learn more?
Want to learn more about Data Science to become a successful Data Scientist?
This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.
15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).