The Ultimate Pic Chart Guide for Matplotlib

What will you learn?

Pie charts are one of the most powerful visualizations when presenting them. With a few tricks you can make them look professional with a free tool like Matplotlib.

In the end of this tutorial you will know how to make pie charts and customize it even further.

Basic Pie Chart

First you need to make a basic Pie chart with matplotlib.

import matplotlib.pyplot as plt
v = [2, 5, 3, 1, 4]
labels = ["A", "B", "C", "D", "E"]
plt.pie(v, labels=labels)
plt.show()

This will create a chart based on the values in v with the labels in labels.

Based on the above Pie Chart we can continue to build further understanding of how to create more advanced charts.

Exploding Segment in Pie Charts

An exploding segment in a pie chart is simply moving segments of the pie chart out.

The following example will demonstrate it.

import matplotlib.pyplot as plt
v = [2, 5, 3, 1, 4]
labels = ["A", "B", "C", "D", "E"]
explode = [0, 0.1, 0, 0.2, 0]
plt.pie(v, labels=labels, explode=explode)
plt.show()

Though not very pretty, it shows you how to control each segment.

Now let’s learn a bit more about how to style it.

Styling Pie Charts

The following list sets the most used parameters for the pie chart.

  • labels The labels.
  • colors The colors.
  • explode Indicates offset of each segment.
  • startangle Angle to start from.
  • counterclock Default True and sets direction.
  • shadow Enables shadow effect.
  • wedgeprops Example {"edgecolor":"k",'linewidth': 1}.
  • autopct Format indicating percentage labels "%1.1f%%".
  • pctdistance Controls the position of percentage labels.

We already know the labels from above. But let’s add some more to see the effect.

import matplotlib.pyplot as plt
v = [2, 5, 3, 1, 4]
labels = ["A", "B", "C", "D", "E"]
colors = ["blue", "red", "orange", "purple", "brown"]
explode = [0, 0, 0.1, 0, 0]
wedge_properties = {"edgecolor":"k",'linewidth': 1}
plt.pie(v, labels=labels, explode=explode, colors=colors, startangle=30,
           counterclock=False, shadow=True, wedgeprops=wedge_properties,
           autopct="%1.1f%%", pctdistance=0.7)
plt.title("Color pie chart")
plt.show()

This does a decent job.

Donut Chart

A great chart to play with is the Donut chart.

Actually, pretty simple by setting wedgeprops as this example shows.

import matplotlib.pyplot as plt
v1 = [2, 5, 3, 1, 4]
labels1 = ["A", "B", "C", "D", "E"]
width = 0.3
wedge_properties = {"width":width}
plt.pie(v1, labels=labels1, wedgeprops=wedge_properties)
plt.show()

The width is taken from outside and in.

Legends on Pie Chart

You can add a legend, which uses the labels. Also, notice that you can set the placement (loc) of the legend.

import matplotlib.pyplot as plt
labels = 'Dalmatians', 'Beagles', 'Labradors', 'German Shepherds'
sizes = [6, 5, 20, 9]
fig, ax = plt.subplots()
ax.pie(sizes, labels=labels, autopct='%.1f%%')
ax.legend(labels, loc='lower left')
plt.show()

Nested Donut Pie Chart

This one is needed in any situation to show a bit off.

import matplotlib.pyplot as plt
v1 = [2, 5, 3, 1, 4]
labels1 = ["A", "B", "C", "D", "E"]
v2 = [4, 1, 3, 4, 1]
labels2 = ["V", "W", "X", "Y", "Z"]
width = 0.3
wedge_properties = {"width":width, "edgecolor":"w",'linewidth': 2}
plt.pie(v1, labels=labels1, labeldistance=0.85,
        wedgeprops=wedge_properties)
plt.pie(v2, labels=labels2, labeldistance=0.75,
        radius=1-width, wedgeprops=wedge_properties)
plt.show()

Want to learn more?

Actually Data Visualization is an important skill to understand and present data.

This is a key skill in Data Science. If you like to learn more then check my free Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).

11 Useful pandas Charts with One Line of Code

What will you learn?

When trying to understand data, visualization is the key to fast understand it!

Data visualization has 3 purposes.

  1. Data Quality: Finding outliers and missing data.
  2. Data Exploration: Understand the data.
  3. Data Presentation: Present the result.

Here you will learn 11 useful charts to understand your data and they are done in one line of code.

The data we will work with

We need some data to work with.

You can either download the Notebook and csv-file (GitHub repo) or read it directly from repository as follows.

import pandas as pd
import matplotlib.pyplot as plt
file_url = 'https://raw.githubusercontent.com/LearnPythonWithRune/pandas_charts/main/air_quality.csv'
data = pd.read_csv(file_url, index_col=0, parse_dates=True)
print(data)

This will output the first 5 lines of the data.

Now let’s use the data we have in the DataFrame data.

If you are new to pandas, I suggest you get an understanding of them from this guide.

#1 Simple plot

A simple plot is the default to use unless you know what you want. It will demonstrate the nature of the data.

Let’s try to do it here.

data.plot()

As you notice, there are three columns of data for the 3 stations: Antwerp, Paris, and London.

The data is a datetime series, meaning, that each data point is part of a time series (the x-axis).

It is a bit difficult to see if station Antwerp has a full dataset.

Let’s try to figure that out.

#2 Isolated plot

This leads us to making an isolated plot of only one column. This is handy to understand each individual column of data better.

Here we were a bit curious about if the data of station Antwerp was given for all dates.

data['station_antwerp'].plot()

This shows that our suspicion was correct. The time series is not covering the full range for station Antwerp.

This tells us about the data quality, which might be crucial for further analysis.

You can do the same for the other two columns.

#3 Scatter Plot

A great way to see if there is a correlation of data, is to make a scatter plot.

Let’s demonstrate how that looks like.

data.plot.scatter(x='station_london', y='station_paris', alpha=.25)

You see that data is not totally scattered all over, but is not fully correlated either. This means, that there is come weak correlation of the data and it is not fully independent of each other.

#4 Box Plot

One way to understand data better is by a box plot. It might need a bit of understanding of simple statistics.

Let’s first take a look at it.

data.plot.box()

The box plot shows the following.

To understand what outliers, min, median, max, and so forth means, I would suggest you read this simple statistic guide.

#5 Area Plot

An area plot can show you the data in a great way to see how the values follow each other in a visual easy way to get an understanding of values, correlation, and missing data.

data.plot.area(figsize=(12,4), subplots=True)

#6 Bar plots

Bar plots can be useful, but often when the data is more limited.

Here you see a bar plot of the first 15 rows of data.

data.iloc[:15].plot.bar()

#7 Histograms for single column

Histograms will show you what data is most common. It shows the frequencies of data divided into bins. By default there are 10 bins of data.

It is an amazing tool to get a fast view of the number of occurrences of each data range.

Here first for an isolated station.

data['station_paris'].plot.hist()

#8 Histograms for multiple columns

Then for all three stations, where you see it with transparency (alpha).

data.plot.hist(alpha=.5)

#9 Pie

Pie charts are very powerful, when you want to show a division of data.

How many percentage belong to each category.

Here you see the mean value of each station.

data.mean().plot.pie()

#10 Scatter Matrix Plot

This is a great tool for showing data for combined in all possible ways. This will show you correlations and how data is distributed.

You need to import an additional library, but it gives you fast understanding of data.

from pandas.plotting import scatter_matrix
scatter_matrix(data, alpha=0.2, figsize=(6, 6))

#11 Secondary y-axis

Finally, sometimes you want two plots on the same chart. The problem can be, that the two plots have very different ranges. hence, you would like to have two different y-axes, with different ranges.

This will enable you to have plots on the same chart with different ranges.

data['station_london'].plot()
data['station_paris'].cumsum().plot(secondary_y=True)

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

Then check my free Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).

15 Most Useful pandas Shortcut Methods

What will you learn?

Everybody likes the pandas data structure DataFrame, but most miss out on what powerful methods it provides.

pandas is a huge module, which makes it difficult to master. Most just use the data structure (DataFrame), without utilizing the power of the methods. In this tutorial you will learn the 15 most useful shortcut that will help you when working with data in pandas data structures.

#1 groupby

The groupby method involves some combination of splitting the object, applying a function, and combining the result.

Wow. That sounds complex. But it is not. It can be used to group large amounts of data and compute operations on these groups.

Best way to learn is to see some example.

import pandas as pd
data = {'Items': ['Apple','Orange', 'Pear', 'Orange', 'Apple'], 
        'Price': [12, 5, 3, 7, 24]}
df = pd.DataFrame(data)

This results in this DataFrame.

The groupby method can group the items together, and apply a function. Let’s try it here.

df.groupby(['Items']).mean()

This will result in this output.

As you see, it has grouped the Apples, Oranges, and the Pears together and for the price column, it has applied the mean() function on the values.

Hence, the Apple has value 18, as it is the mean of 12 and 24 ((12 + 24)/2). Similar, for Orange and Pear.

#2 memory_usage()

We get more and more data and our project get bigger and bigger. At one point you will need to analyze how much memory your data is using.

What memory_usage does, is, it returns the memory usage of each column in the DataFrame. Sometimes, the data type of a column is object, what that means is, that it is pointing to another object. To get the data usage of these objects included, you need to use the deep=True argument.

Let’s try both, to see the difference.

import pandas as pd
dtypes = ['int64', 'float64', 'complex128', 'object', 'bool']
data = dict([(t, np.ones(shape=1000, dtype=int).astype(t)) for t in
                                                          dtypes])
df = pd.DataFrame(data)
print(df.head())

Then we can get the memory usage as follows.

print(df.memory_usage())

Giving the following.

Index           128
int64          8000
float64        8000
complex128    16000
object         8000
bool           1000
dtype: int64

Also, with deep=True.

df.memory_usage(deep=True)

Giving the following where you see the object column is uses more space.

Index           128
int64          8000
float64        8000
complex128    16000
object        36000
bool           1000
dtype: int64

#3 clip()

clip() can trim values at the input threshold.

I find this is easiest to understand by inspecting an example.

import pandas as pd
data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]}
df = pd.DataFrame(data)
print(df)

Then we apply the clip, which will ensure the values below -2 are replaced with -2, and values above 5, are replaced with 5. It clips the values.

print(df.clip(-2, 5))

#4 corr()

The correlation between the values in a column can be calculate with corr(). There are different methods to use: Pearson, Kendall, and Spearman. By default it uses the Pearson method, which will do fine giving you an idea if columns are correlated.

Let’s try an example.

import pandas as pd
df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],
                  columns=['dogs', 'cats'])

The correlation is given by.

print(df.corr())

The value 1.0 is saying it is perfect correlation, which are shown in the diagonal. This makes sense, as the diagonal is the column with itself.

To learn more about correlation and statistics, be sure to check this tutorial out, which also explains the correlation value and how to interpret it.

#5 argmin()

The name argmin is a bit strange. What it does, it returns the position (the index) of the smallest value in a Series (column of a DataFrame).

import pandas as pd
s = pd.Series({'Corn Flakes': 100.0, 'Almond Delight': 110.0,
               'Cinnamon Toast Crunch': 120.0, 'Cocoa Puff': 110.0})
print(s)

Gives.

Corn Flakes              100.0
Almond Delight           110.0
Cinnamon Toast Crunch    120.0
Cocoa Puff               110.0
dtype: float64

And to get the position of the smallest value, just apply the method.

print(s.argmin())

Which will give 0. Remember that it is zero-index, meaning that the first element has index 0.

#6 argmax()

Just like argmin, then argmax() returns the largest element in a Series.

Continue with the example from above.

print(s.argmax())

This will give 2, as it is the largest element in the series.

#7 compare()

Want to know the differences between DataFrames? Then compare does a great job at that.

import pandas as pd
import numpy as np
df = pd.DataFrame(
     {
         "col1": [1.0, 2.0, 3.0, np.nan, 5.0],
         "col2": [1.0, 2.0, 3.0, 4.0, 5.0]
     },
     columns=["col1", "col2"],
)

We can compare the columns here.

df['col1'].compare(df['col2'])

As you see, the only row that differ is the above.

#8 replace()

Did you ever need to replace a value in a DataFrame? Well, it also has a method for that and it is called replace().

df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
                   'B': [5, 6, 7, 8, 9],
                   'C': ['a', 'b', 'c', 'd', 'e']})

Let’s try to replace 5 with -10 and see what happens.

print(df.replace(5, -10))

#9 isna()

Wanted to find missing values? Then isna can do that for you.

Let’s try it.

import pandas as pd
import numpy as np
df = pd.DataFrame(dict(age=[5, 6, np.NaN],
                  born=[pd.NaT, pd.Timestamp('1939-05-27'),
                        pd.Timestamp('1940-04-25')],
                  name=['Alfred', 'Batman', ''],
                  toy=[None, 'Batmobile', 'Joker']))

Then you get the values as follows.

print(df.isna())

I often use it also in a combination with sum(), which will then tell how many rows in each column are missing. This is interesting to get an idea about the quality of the dataset.

print(df.isna().sum())
age     1
born    1
name    0
toy     1
dtype: int64

#10 interpolation()

On the subject of missing values, what to do? Well, there are many options, but one simple can be to interpolate the values.

import pandas as pd
import numpy as np
s = pd.Series([0, 1, np.nan, 3])

This gives the following series.

0    0.0
1    1.0
2    NaN
3    3.0
dtype: float64

Then you can interpolate and get the value between them.

print(s.interpolate())
0    0.0
1    1.0
2    2.0
3    3.0
dtype: float64

This is just one way to deal with it. Dealing with missing values is a big subject. To learn more read this tutorial on the subject.

#11 drop()

Ever needed to remove a column in a DataFrame? Well, again they made a method for that.

Let’s try the drop() method to remove a column.

import pandas as pd
data = {'Age': [-44,0,5, 15, 10, -3], 
        'Salary': [0,5,-2, -14, 19, 24]}
df = pd.DataFrame(data)

Then let’s remove the Age column.

df2 = df.drop('Age', axis='columns')
print(df2)

Notice, that it returns a new DataFrame.

#12 drop_duplicates()

Dealing with data that has duplicate rows? Well, it is a common problem and pandas made a method to easily remove them from your DataFrame.

It is called drop_duplicates and does what it says.

Let’s try it.

import pandas as pd
df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 15, 5]
})

This DataFrame as duplicate rows. Let’s see how they can be removed.

df2 = df.drop_duplicates()
print(df2)

#13 sum()

Ever needed to sum a column? Even with multi index?

Let’s try.

import pandas as pd
idx = pd.MultiIndex.from_arrays([
    ['warm', 'warm', 'cold', 'cold'],
    ['dog', 'falcon', 'fish', 'spider']],
    names=['blooded', 'animal'])
s = pd.Series([4, 2, 0, 8], name='legs', index=idx)
print(s)

This will output.

blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
In [29]:

Then this will sum the column.

print(s.sum())

And it will output 14, as expected.

#14 cumsum()

Wanted to make a cumulative sum? Then cumsum() does the job for you, even with missing numbers.

import pandas as pd
s = pd.Series([2, np.nan, 5, -1, 0])
print(s)

This will give.

0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

And then.

print(s.cumsum())

Gives.

0    2.0
1    NaN
2    7.0
3    6.0
4    6.0
dtype: float64

Where it makes a cumulative sum down the column.

#15 value_counst()

The value_counts() method returns the number of unique rows in a DataFrame.

This requires an example to really understand.

df = pd.DataFrame({'num_legs': [2, 4, 4, 6],
                   'num_wings': [2, 0, 0, 0]},
                  index=['falcon', 'dog', 'cat', 'ant'])

Here we see we have two rows with 4 and 0 (in that order), while the other rows have unique values.

print(df.value_counts())
num_legs  num_wings
4         0            2
2         2            1
6         0            1
dtype: int64

We see there are two rows with 4 and 0, and one of the other rows.

Bonus: unique()

Wanted the unique elements in your Series?

Here you go.

import pandas as pd
s = pd.Series([2, 1, 3, 3], name='A')
print(s.unique())

This will give the unique elements.

array([2, 1, 3])

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

Then check my free Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).

The Ultimate Data Science Workflow Template

What will you learn?

Data Science Workflow

When it comes to creating a good Data Science Project you will need to ensure you cover a great deal of aspects. This template will show you what to cover and where to find more information on a specific topic.

The common pitfall for most junior Data Scientist is to focus on the very technical part of the Data Science Workflow. To add real value to the clients you need to focus on more steps, which are often neglected.

This guide will walk you through all steps and elaborate and link to in-depth content if you need more explanations.

Step 1: Acquire

  • Explore problem
  • Identify data
  • Import data

Step 1.a: Define Problem

If you are making a hoppy project, there might not be a definition of what you are trying to solve. But it is always good practice to start with it. Otherwise, you will most likely just do what you usually do and feel comfortable about. Try to sit down and figure it out.

It should be clear, that this step is before you have the data. That said, it often happens that a company has data and doesn’t know what to use it for.

Still, it all starts by defining a problem.

Here are some guidelines.

  • When defining a problem, don’t be too ambitious
    • Examples:
      • A green energy windmill producer need to optimize distribution and need better prediction on production based on weather forecasts
      • An online news media is interested in a story with how CO2 per capita around the world has evolved over the years
    • Both projects are difficult
      • For the windmill we would need data on production, maintenance periods, detailed weather data, just to get started.
      • The data for CO2 per capita is available on World Bank, but creating a visual story is difficult with our current capabilities
  • Hence, make a better research problem
    • You can start by considering a dataset and get inspiration
    • Examples of datasets
    • Example of Problem
      • What is the highest rated movie genre?

Data Science: Understanding the Problem

  • Get the right question:
    • What is the problem we try to solve?
    • This forms the Data Science problem
    • Examples
      • Sales figure and call center logs: evaluate a new product
      • Sensor data from multiple sensors: detect equipment failure
      • Customer data + marketing data: better targeted marketing
  • Assess situation
    • Risks, Benefits, Contingencies, Regulations, Resources, Requirement
  • Define goal
    • What is the objective?
    • What is the success criteria?
  • Conclusion
    • Defining the problem is key to successful Data Science projects

Step 1.b: Import libraries

When you work on project, you need somewhere have the data. A great place to start is by using pandas.

If you work in a JuPyter notebook you can run this in a cell to get started and follow this guide.

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Step 1.c: Identify the Data

Great Places to Find Data

Step 1.d: Import Data

Read CSV files (Learn more here)

Excel files  (Learn more here)

  • Most videly used spreadsheet
  • Learn more about Excel processing in this lecture
  • read_excel() Read an Excel file into a pandas DataFrame.data = pd.read_excel('files/aapl.xlsx', index_col='Date')

Parquet files  (Learn more here)

  • Parquet is a free open source format
  • Compressed format
  • read_parquet() Load a parquet object from the file path, returning a DataFrame.data = pd.read_parquet('files/aapl.parquet')

Web Scraping (Learn more here)

  • Extracting data from websites
  • Leagal issues: wikipedia.org
  • read_html() Read HTML tables into a list of DataFrame objects.url = "https://en.wikipedia.org/wiki/Wikipedia:Fundraising_statistics" data = pd.read_html(url)

Databases (Learn more here)

  • read_sql() Read SQL query or database table into a DataFrame.
  • The sqlite3 is an interface for SQLite databases.import sqlite3 import pandas as pd conn = sqlite3.connect('files/dallas-ois.sqlite') data = pd.read_sql('SELECT * FROM officers', conn)

Step 1.e: Combine data

Also see guide here.

  • Often you need to combine data
  • Often we need to combine data from different sources

pandas DataFrames

  • pandas DataFrames can combine data (pandas cheat sheet)
  • concat([df1, df2], axis=0)concat Concatenate pandas objects along a particular axis 
  • df.join(other.set_index('key'), on='key')join Join columns of another DataFrame.
  • df1.merge(df2, how='inner', on='a') merge Merge DataFrame or named Series objects with a database-style join

Step 2: Prepare

  • Explore data
  • Visualize ideas
  • Cleaning data

Step 2.a: Explore data

  • head() Return the first n rows.
  • .shape Return a tuple representing the dimensionality of the DataFrame.
  • .dtypes Return the dtypes in the DataFrame.
  • info() Print a concise summary of a DataFrame.
  • describe() Generate descriptive statistics.
  • isna().any() Returns if any element is missing.

Step 2.b: Groupby, Counts and Statistics

Read the guide on statistics here.

  • Count groups to see the significance across resultsdata.groupby('Gender').count()
  • Return the mean of the values over the requested axis.data.groupby('Gender').mean()
  • Standard Deviation
    • Standard deviation is a measure of how dispersed (spread) the data is in relation to the mean.
    • Low standard deviation means data is close to the mean.
    • High standard deviation means data is spread out.
  • data.groupby('Gender').std()
  • Box plots
    • Box plots is a great way to visualize descriptive statistics
    • Notice that Q1: 25%, Q2: 50%, Q3: 75%
  • Make a box plot of the DataFrame columns plot.box()
data.boxplot()

Step 2.c: Visualize data

Read the guide on visualization for data science here.

Simple Plot

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv('files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0)
data['USA'].plot()
  • Adding title and labels
    • title='Tilte' adds the title
    • xlabel='X label' adds or changes the X-label
    • ylabel='X label' adds or changes the Y-labeldata['USA'].plot(title='US CO2 per capita', ylabel='CO2 (metric tons per capita)')
  • Adding ranges
    • xlim=(min, max) or xlim=min Sets the x-axis range
    • ylim=(min, max) or ylim=min Sets the y-axis rangedata['USA'].plot(title='US CO2 per capita', ylabel='CO2 (metric tons per capita)', ylim=0)
  • Comparing datadata[['USA', 'WLD']].plot(ylim=0)

Scatter Plot

  • Good to see any connectiondata = pd.read_csv('files/sample_corr.csv') data.plot.scatter(x='x', y='y')

Histogram

  • Identifying qualitydata = pd.read_csv('files/sample_height.csv') data.plot.hist()
  • Identifying outliersdata = pd.read_csv('files/sample_age.csv') data.plot.hist()
  • Setting bins and figsizedata = pd.read_csv('files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0) data['USA'].plot.hist(figsize=(20,6), bins=10)

Bar Plot

  • Normal plotdata = pd.read_csv('files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0) data['USA'].plot.bar()
  • Range and columns, figsize and labeldata[['USA', 'DNK']].loc[2000:].plot.bar(figsize=(20,6), ylabel='CO emmission per capita')

Pie Chart

  • Presentingdf = pd.Series(data=[3, 5, 7], index=['Data1', 'Data2', 'Data3']) df.plot.pie()
  • Value counts in Pie Charts
    • colors=<list of colors>
    • labels=<list of labels>
    • title='<title>'
    • ylabel='<label>'
    • autopct='%1.1f%%' sets percentages on chart(data['USA'] < 17.5).value_counts().plot.pie(colors=['r', 'g'], labels=['>= 17.5', '< 17.5'], title='CO2', autopct='%1.1f%%')

Step 2.d: Clean data

Read the data cleaning guide here.

  • dropna() Remove missing values.
  • fillna() Fill NA/NaN values using the specified method.
    • Example: Fill missing values with mean.data = data.fillna(data.mean())
  • drop_duplicates() Return DataFrame with duplicate rows removed.
  • Working with time series
    • reindex() Conform Series/DataFrame to new index with optional filling logic.
    • interpolate() Fill NaN values using an interpolation method.
  • Resources

Step 3: Analyze

  • Feature selection
  • Model selection
  • Analyze data

Step 3.a: Split into Train and Test

For an introduction to Machine Learning read this guide.

  • Assign dependent features (those predicting) to X
  • Assign classes (labels/independent features) to y
  • Divide into training and test setsfrom sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 3.b: Feature Scaling

Learn about Feature Scaling in this guide.

  • Feature Scaling transforms values in the similar range for machine learning algorithms to behave optimal.
  • Feature Scaling can be a problems for Machine Learing algorithms on multiple features spanning in different magnitudes.
  • Feature Scaling can also make it is easier to compare resultsFeature Scaling Techniques
  • Normalization is a special case of MinMaxScaler
    • Normalization: Converts values between 0-1(values - values.min())/(values.max() - values.min())
    • MinMaxScaler: Between any values
  • Standardization (StandardSclaer from sklearn)
    • Mean: 0, StdDev: 1(values - values.mean())/values.std()
    • Less sensitive to outliers

Normalization

  • MinMaxScaler Transform features by scaling each feature to a given range.
  • MinMaxScaler().fit(X_train) is used to create a scaler.
    • Notice: We only do it on training datafrom sklearn.preprocessing import MinMaxScaler norm = MinMaxScaler().fit(X_train) X_train_norm = norm.transform(X_train) X_test_norm = norm.transform(X_test)

Standarization

  • StandardScaler Standardize features by removing the mean and scaling to unit variance.from sklearn.preprocessing import StandardScaler scale = StandardScaler().fit(X_train) X_train_stand = scale.transform(X_train) X_test_stand = scale.transform(X_test)

Step 3.c: Feature Selection

Learn about Feature Selection in this guide.

  • Feature selection is about selecting attributes that have the greatest impact towards the problem you are solving.

Why Feature Selection?

  • Higher accuracy
  • Simpler models
  • Reducing overfitting risk

Feature Selection Techniques

Filter methods
  • Independent of Model
  • Based on scores of statistical
  • Easy to understand
  • Good for early feature removal
  • Low computational requirements
Examples
Wrapper methods
  • Compare different subsets of features and run the model on them
  • Basically a search problem
Examples

See more on wikipedia

Embedded methods
  • Find features that contribute most to the accuracy of the model while it is created
  • Regularization is the most common method – it penalizes higher complexity
Examples

Remove constant and quasi constant features

  • VarianceThreshold Feature selector that removes all low-variance features.from sklearn.feature_selection import VarianceThreshold sel = VarianceThreshold() sel.fit_transform(data)Remove correlated features
  • The goal is to find and remove correlated features
  • Calcualte correlation matrix (assign it to corr_matrix)
  • A feature is correlated to any previous features if the following is true
    • Notice that we use correlation 0.8corr_features = [feature for feature in corr_matrix.columns if (corr_matrix[feature].iloc[:corr_matrix.columns.get_loc(feature)] > 0.8).any()]

Step 3.d: Model Selection

Learn about Model Selection in this guide.

  • The process of selecting the model among a collection of candidates machine learning models

Problem type

  • What kind of problem are you looking into?
    • ClassificationPredict labels on data with predefined classes
      • Supervised Machine Learning
    • ClusteringIdentify similarieties between objects and group them in clusters
      • Unsupervised Machine Learning
    • RegressionPredict continuous values
      • Supervised Machine Learning
  • Resource: Sklearn cheat sheet

Model Selection Techniques

  • Probabilistic Measures: Scoring by performance and complexity of model.
  • Resampling Methods: Splitting in sub-train and sub-test datasets and scoring by mean values of repeated runs.

A few models

  • LinearRegression Ordinary least squares Linear Regression (Lesson 08).from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score lin = LinearRegression() lin.fit(X_train, y_train) y_pred = lin.predict(X_test) r2_score(y_test, y_pred)
  • SVC C-Support Vector Classification (Lesson 10).from sklearn.svm import SVC, LinearSVC from sklearn.metrics import accuracy_score svc = LinearSVC() svc.fit(X_train, y_train) y_pred = svc.predict(X_test) accuracy_score(y_test, y_pred)
  • KNeighborsClassifier Classifier implementing the k-nearest neighbors vote (Lesson 10).from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score neigh = KNeighborsClassifier() neigh.fit(X_train.fillna(-1), y_train) y_pred = neigh.predict(X_test.fillna(-1)) accuracy_score(y_test, y_pred)

Step 3.e: Analyze Result

This is the main check-point of your analysis.

  • Review the Problem and Data Science problem you started with.
    • The analysis should add value to the Data Science Problem
    • Sometimes our focus drifts – we need to ensure alignment with original Problem.
    • Go back to the Exploration of the Problem – does the result add value to the Data Science Problem and the initial Problem (which formed the Data Science Problem)
    • Example: As Data Scientist we often find the research itself valuable, but a business is often interested in increasing revenue, customer satisfaction, brand value, or similar business metrics.
  • Did we learn anything?
    • Does the Data-Driven Insights add value?
    • Example: Does it add value to have evidence for: Wealthy people buy more expensive cars.
      • This might add you value to confirm this hypothesis, but does it add any value for car manufacturer?
  • Can we make any valuable insights from our analysis?
    • Do we need more/better/different data?
    • Can we give any Actionable Data Driven Insights?
    • It is always easy to want better and more accurate high quality data.
  • Do we have the right features?
    • Do we need eliminate features?
    • Is the data cleaning appropriate?
    • Is data quality as expected?
  • Do we need to try different models?
    • Data Analysis is an iterative process
    • Simpler models are more powerful
  • Can result be inconclusive?
    • Can we still give recommendations?

Quote

“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”

  • Sherlock Holmes

Iterative Research Process

  • Observation/Question: Starting point (could be iterative)
  • Hypothesis/Claim/Assumption: Something we believe could be true
  • Test/Data collection: We need to gether relevant data
  • Analyze/Evidence: Based on data collection did we get evidence?
    • Can our model predict? (a model is first useful when it can predict)
  • ConcludeWarning! E.g.: We can conclude a correlation (this does not mean A causes B)
    • Example: Based on the collected data we can see a correlation between A and B

Step 4: Report

  • Present findings
  • Visualize results
  • Credibility counts

Step 4.a: Present Findings

  • You need to sell or tell a story with the findings.
  • Who is your audience?
    • Focus on technical level and interest of your audience
    • Speak their language
    • Story should make sense to audience
    • Examples
      • Team manager: Might be technical, but often busy and only interested in high-level status and key findings.
      • Data engineer/science team: Technical exploration and similar interest as you
      • Business stakeholders: This might be end-customers or collaboration in other business units.
  • When presenting
    • Goal: Communicate actionable insights to key stakeholders
    • Outline (inspiration):
      • TL;DR (Too-long; Didn’t read) – clear and concise summary of the content (often one line) that frames key insights in the context of impact on key business metrics.
      • Start with your understanding of the business problem
      • How does it transform into a Data Science Problem
      • How will to measure impact – what business metrics are indicators of results
      • What data is available and used
      • Presenting hypthosis of reseach
      • A visual presentation of the insights (model/analysis/key findings)
        • This is where you present the evidence for the insights
      • How to use insight and create actions
      • Followup and continuous learning increasing value

Step 4.b: Visualize Results

  • Telling a story with the data
  • This is where you convince that the findings/insights are correct
  • The right visualization is important
    • Example: A correlation matrix might give a Data Engineer insights in how findings where discovered, but confuse business partners.

Resources for visualization

  • Seaborn Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
  • Plotly open-source for analytic apps in Python
  • Folium makes it easy to visualize data that’s been manipulated in Python on an interactive leaflet map.

Step 4.c: Credibility Counts

  • This is the check point if your research is valid
    • Are you hiding findings you did not like (not supporting your hypothesis)?
    • Remember it is the long-term relationship that counts
  • Don’t leave out results
    • We learn from data and find hidden patterns, to make data-driven decisions, with a long-term perspective

Step 5: Actions

  • Use insights
  • Measure impact
  • Main goal

Step 5.a: Use Insights

  • How do we follow up on the presented Insights?
  • No one-size-fits-all: It depends on the Insights and Problem
  • Examples:
    1. Problem: What customers are most likely to cancel subscription?
      • Say, we have insufficient knowledge of customers, and need to get more, hence we have given recommendations to gather more insights
      • But you should still try to add value
    2. Problem: Here is our data – find valuable insights!
      • This is a challenge as there is no given focus
      • An iterative process involving the customer can leave you with no surprises

Step 5.b: Measure Impact

  • If customer cannot measure impact of your work – they do not know what they pay for.
    • If you cannot measure it – you cannot know if hypothesis are correct.
    • A model is first valuable when it can be used to predict with some certainty
  • There should be identified metrics/indicators to evaluate in the report
  • This can evolve – we learn along the way – or we could be wrong.
  • How long before we expect to see impact on identified business metrics?
  • What if we do not see expected impact?
  • Understanding of metrics
    • The metrics we measure are indicators that our hypthesis is correct
    • Other aspects can have impact on the result – but you need to identify that

Main Goal

  • Your success of a Data Scientist is to create valuable actionable insights

A great way to think

  • Any business/organisation can be thought of as a complex system
    • Nobody understands it perfectly and it evolves organically
  • Data describes some aspect of it
  • It can be thought of as a black-box
  • Any insights you can bring is like a window that sheds light on what happens inside

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).

How to Choose the Best Machine Learning Model

What will you learn?

This guide will help you to choose the right Machine Learning model for your project. It will also teach you that there is no best model, as all models have predictive error. This means, that you should seek a model that is good enough.

You will learn about Model Selection Techniques like Probabilistic Measures and Resampling Methods.

Step 1: Problem type

The process of selecting the a model for your Machine Learning starts with the type of Problem you work with.

There are 3 high level types of problems.

  • What kind of problem are you looking into?
    • ClassificationPredict labels on data with predefined classes
      • Supervised Machine Learning
    • ClusteringIdentify similarieties between objects and group them in clusters
      • Unsupervised Machine Learning
    • RegressionPredict continuous values
      • Supervised Machine Learning

A great guide is the Sklearn cheat sheet, which helps you to narrow down using the problem types.

Step 2: Model Selection Techniques

As said, all models have predictive errors and the goal isn’t to fit a model 100% on your training-test datasets. Your goal is to have create a simple model, which can predict future values.

This means, that you should seek a model that is good enough for the task.

But how do you do that?

You should use a model selection technique to find a good enough model.

Model Selection Techniques

  • Probabilistic Measures: Scoring by performance and complexity of model.
  • Resampling Methods: Splitting in sub-train and sub-test datasets and scoring by mean values of repeated runs.

Step 3: Example of testing a model

We will look at a dataset and run a few tests. It will not cover in-depth example of the above methods. But it will tweak it and convert a problem type into another category of type. This can actually sometimes be a good approach.

Hence, we take a Regression problem and turn it into a classification problem.

Even though the data is of a regression type of problem, maybe what you are looking for is not the specific values, and you can turn the problem into a classification problem, and get more valuable results from your model.

Let’s try it.

import pandas as pd
data = pd.read_parquet('https://github.com/LearnPythonWithRune/DataScienceWithPython/raw/main/files/house_sales.parquet')
data['SalePrice'].plot.hist(bins=20)

Now – let’s convert it into categories.

  • cut() Bin values into discrete intervals.
    • Data in bins based on data distribution.
  • qcut() Quantile-based discretization function.
    • Data in equal size bins

In this case the qcut is more appropriate, as the data is lying skewed as the diagram shows above.

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import accuracy_score
data['Target'] = pd.qcut(data['SalePrice'], q=3, labels=[1, 2, 3])
data['Target'].value_counts()/len(data)
X = data.drop(['SalePrice', 'Target'], axis=1).fillna(-1)
y = data['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)
svc = LinearSVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
accuracy_score(y_test, y_pred)

This makes 3 target groups of equal size and runs a Linear SVC model on it. The accuracy score is around 0.73.

To see if that is good, we will need to experiment a bit.

Also, notice that the group division like we did might not be perfect, as it is done by assigning 33% in each group.

from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier()
neigh.fit(X_train, y_train)
y_pred = neigh.predict(X_test)
accuracy_score(y_test, y_pred)

This gives 0.72.

See more experiments in the video at the top of the page.

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).

How to make Feature Selection with pandas DataFrames

What will we cover?

You will learn what Feature Selection is and why it matters. It will be demonstrated on a dataset.

  • How Feature Selection gives you higher accuracy.
  • That Feature Selection gives simpler models.
  • It minimized risk of overfitting the models.
  • Learn the main Feature Selection Techniques.
  • That Filter Methods are independent of the model.
  • This includes removing Quasi-constant features.
  • How removing correlated features improves the model.
  • That Wrapper Methods are similar to a search problem.
  • Forward Selection works for Classification and Regression.

Step 1: What is Feature Selection?

Feature Selection can be explained as follows.

Feature Selection

  • Feature selection is about selecting attributes that have the greatest impact towards the problem you are solving.
  • Notice: It should be clear that all steps are interconnected.

Why Feature Selection?

  • Higher accuracy
  • Simpler models
  • Reducing overfitting risk

See more details on wikipedia

Step 2: Feature Selection Techniques

On a high level there are 3 types of Feature Selection Techniques.

Filter methods

  • Independent of Model
  • Based on scores of statistical
  • Easy to understand
  • Good for early feature removal
  • Low computational requirements

Examples

Wrapper methods

  • Compare different subsets of features and run the model on them
  • Basically a search problem

Examples

See more on wikipedia

Embedded methods

  • Find features that contribute most to the accuracy of the model while it is created
  • Regularization is the most common method – it penalizes higher complexity

Examples

Feature Selection Resources

Step 3: Preparation for Feature Selection

It should be noted that there are some steps before Feature Selection.

It should also be clear that feature selection should only be done on training data, as you should assume no knowledge of the testing data.

Step 4: Filter Method – Quasi-constant features

Let’s try an example by removing quasi-constant features. Those a features that are almost constant. It should be clear that features that are constant all the time do not provide any value. Features that are almost the same value all the time, also provide little value.

To do that we use the following.

Using Sklearn

  • Remove constant and quasi constant features
  • VarianceThreshold Feature selector that removes all low-variance features.
import pandas as pd
from sklearn.feature_selection import VarianceThreshold
data = pd.read_parquet('https://github.com/LearnPythonWithRune/DataScienceWithPython/raw/main/files/customer_satisfaction.parquet')
sel = VarianceThreshold(threshold=0.01)
sel.fit_transform(data)
quasi_constant = [col for col in data.columns if col not in sel.get_feature_names_out()]
len(quasi_constant)

This reveals that actually 97 of the features are more than 99% constant.

Step 5: Filter Method – Correlated features

The goal is to find and remove correlated features as they give the same value for the most part. Hence, they do not contribute much.

  • Calculate correlation matrix (assign it to corr_matrix)
  • A feature is correlated to any previous features if the following is true
    • Notice that we use correlation 0.8feature = 'imp_op_var39_comer_ult1' (corr_matrix[feature].iloc[:corr_matrix.columns.get_loc(feature)] > 0.8).any()
  • Get all the correlated features by using list comprehension
train = data[sel.get_feature_names_out()]
corr_matrix = train.corr()
corr_features = [feature for feature in corr_matrix.columns if (corr_matrix[feature].iloc[:corr_matrix.columns.get_loc(feature)] > 0.8).any()]

This will get the correlated features that are more than 0.8 correlated.

Step 6: Wrapper Method – Forward Selection

  • SequentialFeatureSelector Sequential Feature Selection for Classification and Regression.
  • First install it by running the following in a terminal pip install mlxtend
  • For preparation remove all quasi-constant features and correlated featuresX = data.drop(['TARGET'] + quasi_features + corr_features, axis=1) y = data['TARGET']
  • To demonstrate this we create a small training setX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.75, random_state=42)
  • We will use the SVC model with the SequentialFeatureSelector.
    • For two features
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
X = data.drop(['TARGET'] + quasi_constant + corr_features, axis=1)
y = data['TARGET']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.9, random_state=42)
sfs = SFS(SVC(), k_features=2, verbose=2, cv=2, n_jobs=8)
sfs.fit(X_train, y_train)

Now that shows a few simple ways to make feature selection.

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).

How to make Feature Scaling with pandas DataFrames

What will we cover?

In this guide you will learn what Feature Scaling is and how to do it using pandas DataFrames. This will be demonstrated on a weather dataset.

Step 1: What is Feature Scaling

  • Feature Scaling transforms values in the similar range for machine learning algorithms to behave optimal.
  • Feature Scaling can be a problems for Machine Learing algorithms on multiple features spanning in different magnitudes.
  • Feature Scaling can also make it is easier to compare results

Feature Scaling Techniques

  • Normalization is a special case of MinMaxScaler
    • Normalization: Converts values between 0-1(values - values.min())/(values.max() - values.min())
    • MinMaxScaler: Between any values
  • Standardization (StandardSclaer from sklearn)
    • Mean: 0, StdDev: 1(values - values.mean())/values.std()
    • Less sensitive to outliers

Machine Learning algorithms

  • Some algorithms are more sensitive than others
  • Distance-based algorithms are most effected by the range of features.

Step 2: Example of Feature Scaling

You will be working with a weather dataset and try to predict the weather tomorrow.

import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/weather.csv', index_col=0, parse_dates=True)
data.describe()

A subset of the description here.

You will first clean the data in a simple way. If you want to learn about cleaning data check this guide out.

Then we will split the data into train and test. If you want to learn about that – then check out this guide.

from sklearn.model_selection import train_test_split
import numpy as np
data_clean = data.drop(['RISK_MM'], axis=1)
data_clean = data_clean.dropna()
X = data_clean.select_dtypes(include='number')
y = data_clean['RainTomorrow']
y = np.array([0 if value == 'No' else 1 for value in y])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

Then let’s make a box plot to see the problem with the data.

X_train.plot.box(figsize=(20,5), rot=90)

The problem is that the data is in the same ranges – which makes it difficult for distance based Machine Learning models.

We need to deal with that.

Step 3: Normalization

Normalization transforms data into the same range.

  • MinMaxScaler Transform features by scaling each feature to a given range.
  • MinMaxScaler().fit(X_train) is used to create a scaler.
    • Notice: We only do it on training data
from sklearn.preprocessing import MinMaxScaler
norm = MinMaxScaler().fit(X_train)
X_train_norm = norm.transform(X_train)
X_test_norm = norm.transform(X_test)
pd.DataFrame(X_train_norm, columns=X_train.columns).plot.box(figsize=(20,5), rot=90)

As we see here then all the data is put into the same range form 0 to 1. This has the challenge that you see how the outliers might dominate the picture.

If you want to learn more about box plots and statistics – then see this introduction.

Step 4: Standardization

StandardScaler Standardize features by removing the mean and scaling to unit variance.

from sklearn.preprocessing import StandardScaler
scale = StandardScaler().fit(X_train)
X_train_stand = scale.transform(X_train)
X_test_stand = scale.transform(X_test)
pd.DataFrame(X_train_stand, columns=X_train.columns).plot.box(figsize=(20,5), rot=90)

This gives that the mean value is 0 and the standard deviation is 1. This can be a great way to deal with data that has a lot of outliers – like this one.

Step 4: Testing it on a Machine Learning model

Let’s test the different approaches on a Machine Learning model.

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

score = []
trainX = [X_train, X_train_norm, X_train_stand]
testX = [X_test, X_test_norm, X_test_stand]
for train, test in zip(trainX, testX):
    svc = SVC()
    
    svc.fit(train, y_train)
    y_pred = svc.predict(test)
    score.append(accuracy_score(y_test, y_pred))
df_svr = pd.DataFrame({'Accuracy score': score}, index=['Original', 'Normalized', 'Standardized'])
df_svr

As you can see that both approaches do better than just leaving the data as it is.

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).

What is Classification – an Introduction to Machine Learning with pandas

What will we cover?

An introduction to what Machine Learning is and what Classification is. This will be demonstrated on examples using pandas and Sklearn.

Classification is a Machine Learning algorithm that tries to classify rows of data into categories.

Step 1: What is Machine Learning?

  • In the classical computing model every thing is programmed into the algorithms. 
    • This has the limitation that all decision logic need to be understood before usage. 
    • And if things change, we need to modify the program.
  • With the modern computing model (Machine Learning) this paradigm is changes. 
    • We feed the algorithms (models) with data.
    • Based on that data, the algorithms (models) make decisions in the program.

Machine Learning with Python – for Beginners

Machine Learning with Python is a 10+ hours FREE course – a journey from zero to mastery.

  • The course consist of the following content.
    • 15 video lessons – which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution.
    • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
    • 15 projects – with step guides to help you structure your solutions and solution explained in the end of video lessons.

Step 2: How Machine Learning works

Machine learning is divided into two phases.

Phase 1: Learning

  • Get Data: Identify relevant data for the problem you want to solve. This data set should represent the type of data that the Machine Learn model will use to predict from in Phase 2 (predction).
  • Pre-processing: This step is about cleaning up data. While the Machine Learning is awesome, it cannot figure out what good data looks like. You need to do the cleaning as well as transforming data into a desired format.
  • Train model: This is where the magic happens, the learning step (Train model). There are three main paradigms in machine learning.
    • Supervised: where you tell the algorithm what categories each data item is in. Each data item from the training set is tagged with the right answer.
    • Unsupervised: is when the learning algorithm is not told what to do with it and it should make the structure itself.
    • Reinforcement: teaches the machine to think for itself based on past action rewards.
  • Test model: Finally, the testing is done to see if the model is good. The training data was divided into a test set and training set. The test set is used to see if the model can predict from it. If not, a new model might be necessary.

Phase 2: Prediction

Step 3: What is Supervised Learning

Supervised Learning

  • Given a dataset of input-output pairs, learn a function to map inputs to outputs
  • There are different tasks – but we start to focus on Classification

Classification

  • Supervised learning: the task of learning a function mapping an input point to a descrete category

Step 4: Example with Iris Flower Dataset

The Iris Flower dataset is one of the datasets everyone has to work with.

  • Kaggle Iris Flower Dataset
  • Consists of three classes: Iris-setosaIris-versicolor, and Iris-virginica
  • Given depedent features can we predict class
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/iris.csv', index_col=0)
print(data.head())

Step 5: Create a Machine Learning Model

  • A Few Machine Learning Models

The Machine Learning is divided into a few steps – including dividing it into train and test dataset. The train dataset is used to train the model, while the test dataset is used to check the accuracy of the model.

  • Steps
    • Step 1: Assign independent features (those predicting) to X
    • Step 2: Assign classes (labels/dependent features) to y
    • Step 3: Divide into training and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    • Step 4: Create the modelsvc = SVC()
    • Step 5: Fit the modelsvc.fit(X_train, y_train)
    • Step 6: Predict with the modely_pred = svc.predict(X_test)
    • Step 7: Test the accuracyaccuracy_score(y_test, y_pred)

Code example here.

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
X = data.drop('Species', axis=1)
y = data['Species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
accuracy_score(y_test, y_pred)

This gives an accurate model.

You can do the same with KNeighborsClassifier.

from sklearn.neighbors import KNeighborsClassifier
kn = KNeighborsClassifier()
kn.fit(X_train, y_train)
y_pred = kn.predict(X_test)
accuracy_score(y_test, y_pred)

Step 6: Find the most important features

  • permutation_importance Permutation importance for feature evaluation.
  • Use the permutation_importance to calculate it.perm_importance = permutation_importance(svc, X_test, y_test)
  • The results will be found in perm_importance.importances_mean
from sklearn.inspection import permutation_importance
perm_importance = permutation_importance(svc, X_test, y_test)
perm_importance.importances_mean

Visualize the features by importance

  • The most important features are given by perm_importance.importances_mean.argsort()
    • HINT: assign it to sorted_idx
  • To visualize it we can create a DataFramepd.DataFrame(perm_importance.importances_mean[sorted_idx], X_test.columns[sorted_idx], columns=['Value'])
  • Then make a barh plot (use figsize)
sorted_idx = perm_importance.importances_mean.argsort()
df = pd.DataFrame(perm_importance.importances_mean[sorted_idx], X_test.columns[sorted_idx], columns=['Value'])
df.plot.barh()
color_map = {'Iris-setosa': 'b', 'Iris-versicolor': 'r', 'Iris-virginica': 'y'}
colors = data['Species'].apply(lambda x: color_map[x])
data.plot.scatter(x='PetalLengthCm', y='PetalWidthCm', c=colors)

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).

How to Clean Data using pandas DataFrames

What will we cover?

What cleaning data is and how it relates to data quality. This guide will show you how to deal with missing data by replacing and interpolate data. How to deal with data outliers and removing duplicates.

Step 1: What is Clearning Data?

Clearning Data requires domain knowledge of the data.

Data Quality is often a measure of how good data is for further analysis or how solid conclusions we can make. Cleaning data can improve the data quality.

If we understand what is meant by Data Quality – for the data we work with, it becomes easier to clean it. The goal of cleaning is to improve the Data Quality and hence, give better results of our data analysis.

  • Improve the quality (if possible)
  • Dealing with missing data (both rows in single entries)
    • Examples include 
      • Replacing missing values/entries with mean values
      • Interpolation of values (in time series)
  • Dealing with data outliers
    • Examples include 
      • Default missing values in system: sometimes as 0-values
      • Wrong values
  • Removing duplicates
    • Common problem to have duplicate entries
  • Process requires domain knowledge

Step 2: Missing Data

A common issue of Data Quality is missing data. This can be fields that are missing and are often easy to detect. In pandas DataFrames they are often represented by NA.

  • A great source to learn about is here.
  • Two types of missing data we consider
    1. NaN data
    2. Rows in time series data

Type 1 is data with NA or NaN.

import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [np.nan, 2, 3], 'b': [4, 5, np.nan]})
df

Type two is missing rows of data.

df = pd.DataFrame([i for i in range(10)], columns=['Data'], index=pd.date_range("2021-01-01", periods=10))
df = df.drop(['2021-01-03', '2021-01-05', '2021-01-06'])
df

You see we are missing obvious data here (missing date).

Step 3: Outliers

Outliers require deeper domain knowledge to spot.

But let’s take an example here.

df = pd.DataFrame({'Weight (kg)': [86, 83, 0, 76, 109, 95, 0]})
df

Here we know that you cannot weigh 0 kg, hence there must be an error in the data.

Step 4: Demonstrating how it affects the Machine Learning models

Let’s dig a bit deeper into it and see if data quality makes any difference.

  • Housing Prices Competition for Kaggle Learn Users
  • The dataset contains a training and testing dataset.
    • The goal is to predict prices on the testing dataset.
  • We will explore how dealing with missing values impacts the prediction of a linear regression model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/home-data/train.csv', index_col=0)
data.head()

We can remove non-numeric values in this example as follows and check for missing values afterwards.

data = data.select_dtypes(include='number')

The missing values are listed as follows.

data.info()

(output not given here).

Let’s make a helper function to calculate the r-square score of a linear regression model. This way we can see how the model will behave differently.

def regression_score(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    lin = LinearRegression()
    lin.fit(X_train, y_train)
    y_pred = lin.predict(X_test)
    return r2_score(y_pred, y_test)

Let’s try some different approaches.

Calculations

  • Try first to calcualte the r-square by using data.dropna()
    • This serves as the ussual way we have done it
  • Then with data.fillna(data.mean())
    • fillna() Fill NA/NaN values using the specified method.
  • Then with data.fillna(data.mode().iloc[0])

Just delete rows with missing data.

test_base = data.dropna()
regression_score(test_base.drop('SalePrice', axis=1), test_base[['SalePrice']])

This gives around 0.65 in score.

Then fill with the mean value.

test_base = data.fillna(data.mean())
regression_score(test_base.drop('SalePrice', axis=1), test_base[['SalePrice']])

This gives 0.74, which is a great improvement.

Try with the mode (the most common value).

test_base = data.fillna(data.mode().iloc[0])
regression_score(test_base.drop('SalePrice', axis=1), test_base[['SalePrice']])

This gives 0.75 a bit better.

Feel free to experiment more, but this should demonstrate that just removing rows with missing data is not a great idea.

Step 5: Dealing with Time Series data

If you work time series data you can often do better.

weather = pd.read_parquet('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/weather.csv')
weather.head()

Missing time series rows

  • One way to find missing rows of data in a time series is as followsidx = pd.Series(data=pd.date_range(start=df.index.min(), end=df.index.max(), freq="H")) mask = idx.isin(df.index) idx[~mask]

This can be done as follows.

idx = pd.Series(data=pd.date_range(start=weather.index.min(), end=weather.index.max(), freq="H"))
w_idx = weather.reindex(idx)
w_idx.interpolate()[w_idx['Summary'].isna()]

This will interpolate values with this.

  • To insert missing datetimes we can use reindex()
  • To interploate values that are missing interpolate

Outliers

  • If we focus on Pressure (millibars) for `2006′
  • One way to handle 0-values is with replace().replace(0, np.nan)
  • Then we can apply interploate()
p_2006 = weather['Pressure (millibars)'].loc['2006']
p_2016.plot()

Here we see that the data is there, but it is zero.

What to do then?

Again interpolate can be used.

p_2016.replace(0, np.nan).interpolate().plot()

Step 6: Dealing with duplicates

Sometimes your data has duplicates. This is a big issue for your model.

Luckily this can be dealt with quite easy.

drop_duplicates() Return DataFrame with duplicate rows removed.

df = pd.DataFrame({'a': [1, 2, 3, 2], 'b': [11, 2, 21, 2], 'c': [21, 2, 31, 2]})
df
df.drop_duplicates()

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).

A Smooth Introduction to Linear Regression using pandas

What will we cover?

Show what Linear Regression is visually and demonstrate it on data.

Step 1: What is Linear Regression

Simply said, you can describe Linear Regression as follows.

  • Given data input (independent variables) can we predict output (dependent variable)
  • It is the mapping from input point to a continuous value

I like to show it visually.

The goal of Linear Regression is to find the best fitting line. Hence, some data will be fitted better as it will be closer to the line.

The predictions will be on the line. That is, when you have fitted your Linear Regression model, it will predict new values to be on the line.

While this sounds simple, the model is one of the most used models and creates high value.

Step 2: Correlation and Linear Regression

Often there is a bit confusing between Linear Regression and Correlation. But they do different things.

Correlation is one number describing a relationship between tow variables. While Linear Regression is an equation used to predict values.

  • Correlation
    • Single measure of relationship between two variables.
  • Linear Regression
    • An equation used for prediction.
  • Similarities
    • Describes relationship between variables

Step 3: Example

Let’s try an example.

import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/weight-height.csv')
data.plot.scatter(x='Height', y='Weight', alpha=.1)

This data looks correlated. How would a Linear Regression prediction of it look like?

We can use Sklearn.

Sklearn

Linear Regression

  • The Linear Regression model takes a collection of observations
  • Each observation has featuers (or variables).
  • The features the model takes as input are called independent (often denoted with X)
  • The feature the model outputs is called dependent (often denoted with y)
from sklearn.linear_model import LinearRegression
# Creating a Linear Regression model on our data
lin = LinearRegression()
lin.fit(data[['Height']], data['Weight'])
# Creating a plot
ax = data.plot.scatter(x='Height', y='Weight', alpha=.1)
ax.plot(data['Height'], lin.predict(data[['Height']]), c='r')

To measure the accuracy of the prediction the r-squared function is often used, which you can access directly on the model by using the following code.

lin.score(data[['Height']], data['Weight'])

This will give 0.855, which is just a number you can use to compare to other samples.

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).