How to Choose the Best Machine Learning Model

What will you learn?

This guide will help you to choose the right Machine Learning model for your project. It will also teach you that there is no best model, as all models have predictive error. This means, that you should seek a model that is good enough.

You will learn about Model Selection Techniques like Probabilistic Measures and Resampling Methods.

Step 1: Problem type

The process of selecting the a model for your Machine Learning starts with the type of Problem you work with.

There are 3 high level types of problems.

  • What kind of problem are you looking into?
    • ClassificationPredict labels on data with predefined classes
      • Supervised Machine Learning
    • ClusteringIdentify similarieties between objects and group them in clusters
      • Unsupervised Machine Learning
    • RegressionPredict continuous values
      • Supervised Machine Learning

A great guide is the Sklearn cheat sheet, which helps you to narrow down using the problem types.

Step 2: Model Selection Techniques

As said, all models have predictive errors and the goal isn’t to fit a model 100% on your training-test datasets. Your goal is to have create a simple model, which can predict future values.

This means, that you should seek a model that is good enough for the task.

But how do you do that?

You should use a model selection technique to find a good enough model.

Model Selection Techniques

  • Probabilistic Measures: Scoring by performance and complexity of model.
  • Resampling Methods: Splitting in sub-train and sub-test datasets and scoring by mean values of repeated runs.

Step 3: Example of testing a model

We will look at a dataset and run a few tests. It will not cover in-depth example of the above methods. But it will tweak it and convert a problem type into another category of type. This can actually sometimes be a good approach.

Hence, we take a Regression problem and turn it into a classification problem.

Even though the data is of a regression type of problem, maybe what you are looking for is not the specific values, and you can turn the problem into a classification problem, and get more valuable results from your model.

Let’s try it.

import pandas as pd

data = pd.read_parquet('https://github.com/LearnPythonWithRune/DataScienceWithPython/raw/main/files/house_sales.parquet')

data['SalePrice'].plot.hist(bins=20)

Now – let’s convert it into categories.

  • cut() Bin values into discrete intervals.
    • Data in bins based on data distribution.
  • qcut() Quantile-based discretization function.
    • Data in equal size bins

In this case the qcut is more appropriate, as the data is lying skewed as the diagram shows above.

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import accuracy_score

data['Target'] = pd.qcut(data['SalePrice'], q=3, labels=[1, 2, 3])
data['Target'].value_counts()/len(data)

X = data.drop(['SalePrice', 'Target'], axis=1).fillna(-1)
y = data['Target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

svc = LinearSVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)

accuracy_score(y_test, y_pred)

This makes 3 target groups of equal size and runs a Linear SVC model on it. The accuracy score is around 0.73.

To see if that is good, we will need to experiment a bit.

Also, notice that the group division like we did might not be perfect, as it is done by assigning 33% in each group.

from sklearn.neighbors import KNeighborsClassifier

neigh = KNeighborsClassifier()
neigh.fit(X_train, y_train)
y_pred = neigh.predict(X_test)

accuracy_score(y_test, y_pred)

This gives 0.72.

See more experiments in the video at the top of the page.

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
Data Science

How to make Feature Selection with pandas DataFrames

What will we cover?

You will learn what Feature Selection is and why it matters. It will be demonstrated on a dataset.

  • How Feature Selection gives you higher accuracy.
  • That Feature Selection gives simpler models.
  • It minimized risk of overfitting the models.
  • Learn the main Feature Selection Techniques.
  • That Filter Methods are independent of the model.
  • This includes removing Quasi-constant features.
  • How removing correlated features improves the model.
  • That Wrapper Methods are similar to a search problem.
  • Forward Selection works for Classification and Regression.

Step 1: What is Feature Selection?

Feature Selection can be explained as follows.

Feature Selection

  • Feature selection is about selecting attributes that have the greatest impact towards the problem you are solving.
  • Notice: It should be clear that all steps are interconnected.

Why Feature Selection?

  • Higher accuracy
  • Simpler models
  • Reducing overfitting risk

See more details on wikipedia

Step 2: Feature Selection Techniques

On a high level there are 3 types of Feature Selection Techniques.

Filter methods

  • Independent of Model
  • Based on scores of statistical
  • Easy to understand
  • Good for early feature removal
  • Low computational requirements

Examples

Wrapper methods

  • Compare different subsets of features and run the model on them
  • Basically a search problem

Examples

See more on wikipedia

Embedded methods

  • Find features that contribute most to the accuracy of the model while it is created
  • Regularization is the most common method – it penalizes higher complexity

Examples

Feature Selection Resources

Step 3: Preparation for Feature Selection

It should be noted that there are some steps before Feature Selection.

It should also be clear that feature selection should only be done on training data, as you should assume no knowledge of the testing data.

Step 4: Filter Method – Quasi-constant features

Let’s try an example by removing quasi-constant features. Those a features that are almost constant. It should be clear that features that are constant all the time do not provide any value. Features that are almost the same value all the time, also provide little value.

To do that we use the following.

Using Sklearn

  • Remove constant and quasi constant features
  • VarianceThreshold Feature selector that removes all low-variance features.
import pandas as pd
from sklearn.feature_selection import VarianceThreshold

data = pd.read_parquet('https://github.com/LearnPythonWithRune/DataScienceWithPython/raw/main/files/customer_satisfaction.parquet')

sel = VarianceThreshold(threshold=0.01)
sel.fit_transform(data)

quasi_constant = [col for col in data.columns if col not in sel.get_feature_names_out()]
len(quasi_constant)

This reveals that actually 97 of the features are more than 99% constant.

Step 5: Filter Method – Correlated features

The goal is to find and remove correlated features as they give the same value for the most part. Hence, they do not contribute much.

  • Calculate correlation matrix (assign it to corr_matrix)
  • A feature is correlated to any previous features if the following is true
    • Notice that we use correlation 0.8feature = 'imp_op_var39_comer_ult1' (corr_matrix[feature].iloc[:corr_matrix.columns.get_loc(feature)] > 0.8).any()
  • Get all the correlated features by using list comprehension
train = data[sel.get_feature_names_out()]

corr_matrix = train.corr()

corr_features = [feature for feature in corr_matrix.columns if (corr_matrix[feature].iloc[:corr_matrix.columns.get_loc(feature)] > 0.8).any()]

This will get the correlated features that are more than 0.8 correlated.

Step 6: Wrapper Method – Forward Selection

  • SequentialFeatureSelector Sequential Feature Selection for Classification and Regression.
  • First install it by running the following in a terminal pip install mlxtend
  • For preparation remove all quasi-constant features and correlated featuresX = data.drop(['TARGET'] + quasi_features + corr_features, axis=1) y = data['TARGET']
  • To demonstrate this we create a small training setX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.75, random_state=42)
  • We will use the SVC model with the SequentialFeatureSelector.
    • For two features
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

X = data.drop(['TARGET'] + quasi_constant + corr_features, axis=1)
y = data['TARGET']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.9, random_state=42)

sfs = SFS(SVC(), k_features=2, verbose=2, cv=2, n_jobs=8)

sfs.fit(X_train, y_train)

Now that shows a few simple ways to make feature selection.

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
Data Science

How to make Feature Scaling with pandas DataFrames

What will we cover?

In this guide you will learn what Feature Scaling is and how to do it using pandas DataFrames. This will be demonstrated on a weather dataset.

Step 1: What is Feature Scaling

  • Feature Scaling transforms values in the similar range for machine learning algorithms to behave optimal.
  • Feature Scaling can be a problems for Machine Learing algorithms on multiple features spanning in different magnitudes.
  • Feature Scaling can also make it is easier to compare results

Feature Scaling Techniques

  • Normalization is a special case of MinMaxScaler
    • Normalization: Converts values between 0-1(values - values.min())/(values.max() - values.min())
    • MinMaxScaler: Between any values
  • Standardization (StandardSclaer from sklearn)
    • Mean: 0, StdDev: 1(values - values.mean())/values.std()
    • Less sensitive to outliers

Machine Learning algorithms

  • Some algorithms are more sensitive than others
  • Distance-based algorithms are most effected by the range of features.

Step 2: Example of Feature Scaling

You will be working with a weather dataset and try to predict the weather tomorrow.

import pandas as pd

data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/weather.csv', index_col=0, parse_dates=True)

data.describe()

A subset of the description here.

You will first clean the data in a simple way. If you want to learn about cleaning data check this guide out.

Then we will split the data into train and test. If you want to learn about that – then check out this guide.

from sklearn.model_selection import train_test_split
import numpy as np

data_clean = data.drop(['RISK_MM'], axis=1)
data_clean = data_clean.dropna()

X = data_clean.select_dtypes(include='number')
y = data_clean['RainTomorrow']
y = np.array([0 if value == 'No' else 1 for value in y])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

Then let’s make a box plot to see the problem with the data.

X_train.plot.box(figsize=(20,5), rot=90)

The problem is that the data is in the same ranges – which makes it difficult for distance based Machine Learning models.

We need to deal with that.

Step 3: Normalization

Normalization transforms data into the same range.

  • MinMaxScaler Transform features by scaling each feature to a given range.
  • MinMaxScaler().fit(X_train) is used to create a scaler.
    • Notice: We only do it on training data
from sklearn.preprocessing import MinMaxScaler

norm = MinMaxScaler().fit(X_train)

X_train_norm = norm.transform(X_train)
X_test_norm = norm.transform(X_test)

pd.DataFrame(X_train_norm, columns=X_train.columns).plot.box(figsize=(20,5), rot=90)

As we see here then all the data is put into the same range form 0 to 1. This has the challenge that you see how the outliers might dominate the picture.

If you want to learn more about box plots and statistics – then see this introduction.

Step 4: Standardization

StandardScaler Standardize features by removing the mean and scaling to unit variance.

from sklearn.preprocessing import StandardScaler

scale = StandardScaler().fit(X_train)

X_train_stand = scale.transform(X_train)
X_test_stand = scale.transform(X_test)

pd.DataFrame(X_train_stand, columns=X_train.columns).plot.box(figsize=(20,5), rot=90)

This gives that the mean value is 0 and the standard deviation is 1. This can be a great way to deal with data that has a lot of outliers – like this one.

Step 4: Testing it on a Machine Learning model

Let’s test the different approaches on a Machine Learning model.

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score


score = []

trainX = [X_train, X_train_norm, X_train_stand]
testX = [X_test, X_test_norm, X_test_stand]

for train, test in zip(trainX, testX):
    svc = SVC()
    
    svc.fit(train, y_train)
    y_pred = svc.predict(test)

    score.append(accuracy_score(y_test, y_pred))

df_svr = pd.DataFrame({'Accuracy score': score}, index=['Original', 'Normalized', 'Standardized'])
df_svr

As you can see that both approaches do better than just leaving the data as it is.

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
Data Science

What is Classification – an Introduction to Machine Learning with pandas

What will we cover?

An introduction to what Machine Learning is and what Classification is. This will be demonstrated on examples using pandas and Sklearn.

Classification is a Machine Learning algorithm that tries to classify rows of data into categories.

Step 1: What is Machine Learning?

  • In the classical computing model every thing is programmed into the algorithms. 
    • This has the limitation that all decision logic need to be understood before usage. 
    • And if things change, we need to modify the program.
  • With the modern computing model (Machine Learning) this paradigm is changes. 
    • We feed the algorithms (models) with data.
    • Based on that data, the algorithms (models) make decisions in the program.

Machine Learning with Python – for Beginners

Machine Learning with Python is a 10+ hours FREE course – a journey from zero to mastery.

  • The course consist of the following content.
    • 15 video lessons – which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution.
    • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
    • 15 projects – with step guides to help you structure your solutions and solution explained in the end of video lessons.

Step 2: How Machine Learning works

Machine learning is divided into two phases.

Phase 1: Learning

  • Get Data: Identify relevant data for the problem you want to solve. This data set should represent the type of data that the Machine Learn model will use to predict from in Phase 2 (predction).
  • Pre-processing: This step is about cleaning up data. While the Machine Learning is awesome, it cannot figure out what good data looks like. You need to do the cleaning as well as transforming data into a desired format.
  • Train model: This is where the magic happens, the learning step (Train model). There are three main paradigms in machine learning.
    • Supervised: where you tell the algorithm what categories each data item is in. Each data item from the training set is tagged with the right answer.
    • Unsupervised: is when the learning algorithm is not told what to do with it and it should make the structure itself.
    • Reinforcement: teaches the machine to think for itself based on past action rewards.
  • Test model: Finally, the testing is done to see if the model is good. The training data was divided into a test set and training set. The test set is used to see if the model can predict from it. If not, a new model might be necessary.

Phase 2: Prediction

Step 3: What is Supervised Learning

Supervised Learning

  • Given a dataset of input-output pairs, learn a function to map inputs to outputs
  • There are different tasks – but we start to focus on Classification

Classification

  • Supervised learning: the task of learning a function mapping an input point to a descrete category

Step 4: Example with Iris Flower Dataset

The Iris Flower dataset is one of the datasets everyone has to work with.

  • Kaggle Iris Flower Dataset
  • Consists of three classes: Iris-setosaIris-versicolor, and Iris-virginica
  • Given depedent features can we predict class
import pandas as pd

data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/iris.csv', index_col=0)

print(data.head())

Step 5: Create a Machine Learning Model

  • A Few Machine Learning Models

The Machine Learning is divided into a few steps – including dividing it into train and test dataset. The train dataset is used to train the model, while the test dataset is used to check the accuracy of the model.

  • Steps
    • Step 1: Assign independent features (those predicting) to X
    • Step 2: Assign classes (labels/dependent features) to y
    • Step 3: Divide into training and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    • Step 4: Create the modelsvc = SVC()
    • Step 5: Fit the modelsvc.fit(X_train, y_train)
    • Step 6: Predict with the modely_pred = svc.predict(X_test)
    • Step 7: Test the accuracyaccuracy_score(y_test, y_pred)

Code example here.

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

X = data.drop('Species', axis=1)
y = data['Species']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
accuracy_score(y_test, y_pred)

This gives an accurate model.

You can do the same with KNeighborsClassifier.

from sklearn.neighbors import KNeighborsClassifier

kn = KNeighborsClassifier()
kn.fit(X_train, y_train)
y_pred = kn.predict(X_test)
accuracy_score(y_test, y_pred)

Step 6: Find the most important features

  • permutation_importance Permutation importance for feature evaluation.
  • Use the permutation_importance to calculate it.perm_importance = permutation_importance(svc, X_test, y_test)
  • The results will be found in perm_importance.importances_mean
from sklearn.inspection import permutation_importance

perm_importance = permutation_importance(svc, X_test, y_test)
perm_importance.importances_mean

Visualize the features by importance

  • The most important features are given by perm_importance.importances_mean.argsort()
    • HINT: assign it to sorted_idx
  • To visualize it we can create a DataFramepd.DataFrame(perm_importance.importances_mean[sorted_idx], X_test.columns[sorted_idx], columns=['Value'])
  • Then make a barh plot (use figsize)
sorted_idx = perm_importance.importances_mean.argsort()
df = pd.DataFrame(perm_importance.importances_mean[sorted_idx], X_test.columns[sorted_idx], columns=['Value'])
df.plot.barh()
color_map = {'Iris-setosa': 'b', 'Iris-versicolor': 'r', 'Iris-virginica': 'y'}

colors = data['Species'].apply(lambda x: color_map[x])

data.plot.scatter(x='PetalLengthCm', y='PetalWidthCm', c=colors)

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
Data Science

How to Clean Data using pandas DataFrames

What will we cover?

What cleaning data is and how it relates to data quality. This guide will show you how to deal with missing data by replacing and interpolate data. How to deal with data outliers and removing duplicates.

Step 1: What is Clearning Data?

Clearning Data requires domain knowledge of the data.

Data Quality is often a measure of how good data is for further analysis or how solid conclusions we can make. Cleaning data can improve the data quality.

If we understand what is meant by Data Quality – for the data we work with, it becomes easier to clean it. The goal of cleaning is to improve the Data Quality and hence, give better results of our data analysis.

  • Improve the quality (if possible)
  • Dealing with missing data (both rows in single entries)
    • Examples include 
      • Replacing missing values/entries with mean values
      • Interpolation of values (in time series)
  • Dealing with data outliers
    • Examples include 
      • Default missing values in system: sometimes as 0-values
      • Wrong values
  • Removing duplicates
    • Common problem to have duplicate entries
  • Process requires domain knowledge

Step 2: Missing Data

A common issue of Data Quality is missing data. This can be fields that are missing and are often easy to detect. In pandas DataFrames they are often represented by NA.

  • A great source to learn about is here.
  • Two types of missing data we consider
    1. NaN data
    2. Rows in time series data

Type 1 is data with NA or NaN.

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': [np.nan, 2, 3], 'b': [4, 5, np.nan]})
df

Type two is missing rows of data.

df = pd.DataFrame([i for i in range(10)], columns=['Data'], index=pd.date_range("2021-01-01", periods=10))
df = df.drop(['2021-01-03', '2021-01-05', '2021-01-06'])
df

You see we are missing obvious data here (missing date).

Step 3: Outliers

Outliers require deeper domain knowledge to spot.

But let’s take an example here.

df = pd.DataFrame({'Weight (kg)': [86, 83, 0, 76, 109, 95, 0]})
df

Here we know that you cannot weigh 0 kg, hence there must be an error in the data.

Step 4: Demonstrating how it affects the Machine Learning models

Let’s dig a bit deeper into it and see if data quality makes any difference.

  • Housing Prices Competition for Kaggle Learn Users
  • The dataset contains a training and testing dataset.
    • The goal is to predict prices on the testing dataset.
  • We will explore how dealing with missing values impacts the prediction of a linear regression model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/home-data/train.csv', index_col=0)
data.head()

We can remove non-numeric values in this example as follows and check for missing values afterwards.

data = data.select_dtypes(include='number')

The missing values are listed as follows.

data.info()

(output not given here).

Let’s make a helper function to calculate the r-square score of a linear regression model. This way we can see how the model will behave differently.

def regression_score(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    lin = LinearRegression()
    lin.fit(X_train, y_train)
    y_pred = lin.predict(X_test)
    return r2_score(y_pred, y_test)

Let’s try some different approaches.

Calculations

  • Try first to calcualte the r-square by using data.dropna()
    • This serves as the ussual way we have done it
  • Then with data.fillna(data.mean())
    • fillna() Fill NA/NaN values using the specified method.
  • Then with data.fillna(data.mode().iloc[0])

Just delete rows with missing data.

test_base = data.dropna()

regression_score(test_base.drop('SalePrice', axis=1), test_base[['SalePrice']])

This gives around 0.65 in score.

Then fill with the mean value.

test_base = data.fillna(data.mean())

regression_score(test_base.drop('SalePrice', axis=1), test_base[['SalePrice']])

This gives 0.74, which is a great improvement.

Try with the mode (the most common value).

test_base = data.fillna(data.mode().iloc[0])

regression_score(test_base.drop('SalePrice', axis=1), test_base[['SalePrice']])

This gives 0.75 a bit better.

Feel free to experiment more, but this should demonstrate that just removing rows with missing data is not a great idea.

Step 5: Dealing with Time Series data

If you work time series data you can often do better.

weather = pd.read_parquet('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/weather.csv')
weather.head()

Missing time series rows

  • One way to find missing rows of data in a time series is as followsidx = pd.Series(data=pd.date_range(start=df.index.min(), end=df.index.max(), freq="H")) mask = idx.isin(df.index) idx[~mask]

This can be done as follows.

idx = pd.Series(data=pd.date_range(start=weather.index.min(), end=weather.index.max(), freq="H"))

w_idx = weather.reindex(idx)

w_idx.interpolate()[w_idx['Summary'].isna()]

This will interpolate values with this.

  • To insert missing datetimes we can use reindex()
  • To interploate values that are missing interpolate

Outliers

  • If we focus on Pressure (millibars) for `2006′
  • One way to handle 0-values is with replace().replace(0, np.nan)
  • Then we can apply interploate()
p_2006 = weather['Pressure (millibars)'].loc['2006']
p_2016.plot()

Here we see that the data is there, but it is zero.

What to do then?

Again interpolate can be used.

p_2016.replace(0, np.nan).interpolate().plot()

Step 6: Dealing with duplicates

Sometimes your data has duplicates. This is a big issue for your model.

Luckily this can be dealt with quite easy.

drop_duplicates() Return DataFrame with duplicate rows removed.

df = pd.DataFrame({'a': [1, 2, 3, 2], 'b': [11, 2, 21, 2], 'c': [21, 2, 31, 2]})
df
df.drop_duplicates()

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
Data Science

A Smooth Introduction to Linear Regression using pandas

What will we cover?

Show what Linear Regression is visually and demonstrate it on data.

Step 1: What is Linear Regression

Simply said, you can describe Linear Regression as follows.

  • Given data input (independent variables) can we predict output (dependent variable)
  • It is the mapping from input point to a continuous value

I like to show it visually.

The goal of Linear Regression is to find the best fitting line. Hence, some data will be fitted better as it will be closer to the line.

The predictions will be on the line. That is, when you have fitted your Linear Regression model, it will predict new values to be on the line.

While this sounds simple, the model is one of the most used models and creates high value.

Step 2: Correlation and Linear Regression

Often there is a bit confusing between Linear Regression and Correlation. But they do different things.

Correlation is one number describing a relationship between tow variables. While Linear Regression is an equation used to predict values.

  • Correlation
    • Single measure of relationship between two variables.
  • Linear Regression
    • An equation used for prediction.
  • Similarities
    • Describes relationship between variables

Step 3: Example

Let’s try an example.

import pandas as pd

data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/weight-height.csv')

data.plot.scatter(x='Height', y='Weight', alpha=.1)

This data looks correlated. How would a Linear Regression prediction of it look like?

We can use Sklearn.

Sklearn

Linear Regression

  • The Linear Regression model takes a collection of observations
  • Each observation has featuers (or variables).
  • The features the model takes as input are called independent (often denoted with X)
  • The feature the model outputs is called dependent (often denoted with y)
from sklearn.linear_model import LinearRegression

# Creating a Linear Regression model on our data
lin = LinearRegression()
lin.fit(data[['Height']], data['Weight'])

# Creating a plot
ax = data.plot.scatter(x='Height', y='Weight', alpha=.1)
ax.plot(data['Height'], lin.predict(data[['Height']]), c='r')

To measure the accuracy of the prediction the r-squared function is often used, which you can access directly on the model by using the following code.

lin.score(data[['Height']], data['Weight'])

This will give 0.855, which is just a number you can use to compare to other samples.

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
Data Science

The Ultimate Statistical Guide for Data Science using pandas

What will we cover?

In this tutorial you will learn all the statistics you need to get started with Data Science.

  • What is statistics?
    • An analysis and interpretation of data.
    • A way to communicate findings.
  • Why do you need statistics?
    • Statistics presents information in an easy way.
    • Gives you an understanding of the data.

Step 1: Example of statistics – the most important statistics you need

Most get surprised by what the most important statistical number is.

But let’s dive into an example.

import pandas as pd

data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/weight-height.csv')

print(data.head())

Count

  • Count is a descriptive statistics and counts observations.
  • Count is the most used in statistics and has high importance to evaluate findings.
    • Example: Making conclusion on childhood weights and the study only had 12 childing (observations). Is that trustworthy?
    • The count says something about the quality of the study

As pointed out, count is the most important statistics in any study. If you made a study based on 3 samples, could you make any general conclusions? Say, you make a check on make height and you have 3 samples. Can you conclude what the average height is from that study?

No, you need more samples. Hence, count is the most important statistics you need.

You can get the count of samples by using groupby.

data.groupby('Gender').count()

This shows the number of samples in each group.

Step 2: Mean

Most know what the average value is. This is also called the mean value. Hence, if the mean value of height in the samples are 69, then this is the average value.

You can also get that with groupby.

data.groupby('Gender').mean()

Step 3: Standard Deviation

What mean doesn’t tell, is the spread of the data.

Let’s try to visualize what I mean.

data[data['Gender'] == 'Male']['Height'].plot.hist(bins=20)

Data could be more spread, meaning, that the samples could be more spread out than you see on this picture. On the other hand, they could also be more together.

What the standard deviation tells you is how data is distributed away from the mean value.

  • Standard deviation is a measure of how dispersed (spread) the data is in relation to the mean.
  • Low standard deviation means data is close to the mean.
  • High standard deviation means data is spread out.

You can get the values with your DataFrame as well.

data.groupby('Gender').std()

Step 4: Describe

The method describe in a pandas DataFrame gives you a lot of useful information.

  • Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
  • See docs
data.describe()

It gives the count, mean, standard deviation, as well as min and max, where the first 25%, 50% and 75% are between.

This is a detailed description of the data.

Step 5: Box Plots

Understanding the describe statistics will make it easy to understand box plots.

  • Box plots is a great way to visualize descriptive statistics
  • Notice that Q1: 25%, Q2: 50%, Q3: 75%

You can get that from your DataFrame as well.

data['Weight'].plot.box(vert=False)

You can get it a bit more handy by using this box plot instead.

data.boxplot(column=['Height', 'Weight'])

And even by gender like this.

data.boxplot(column=['Height', 'Weight'], by='Gender')

Step 6: Correlation

Correlation is a great way to find if data is somehow correlated.

Remember the saying: Correlation is not causation.

Measure the relationship between two variables and ranges from -1 to 1

A great way to undersand the numbers is by scatter plots.

Let’s check our data.

data.plot.scatter(x='Height', y='Weight', alpha=.1)

And try to calculate the correlation.

data.corr()

You can also groupby by Gender.

data.groupby('Gender').corr()

This basically covers the statistics you need to know and how you can easily do them with pandas DataFrames.

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
Data Science

How to Combine pandas DataFrames

What will we cover?

How to combine data from different DataFrames into one DataFrame.

There are a few methods that are normally used.

  • concat([df1, df2], axis=0)concat Concatenate pandas objects along a particular axis 
  • df.join(other.set_index('key'), on='key')join Join columns of another DataFrame.
  • df1.merge(df2, how='inner', on='a') merge Merge DataFrame or named Series objects with a database-style join.

Also see the pandas cheat sheet for details (pandas cheat sheet).

Step 1: Getting some demonstration data

In this tutorial we will use some data and meta data and combine them into one DataFrame. The data is from the World Bank database.

The data can be downloaded directly from the World Bank or from my GitHub. Here we just access it directly from the GitHub.

import pandas as pd

data_file = 'https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/API_SP/API_SP.POP.TOTL_DS2_en_csv_v2_3158886.csv'

data = pd.read_csv(data_file, skiprows=4)

meta_file = 'https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/API_SP/Metadata_Country_API_SP.POP.TOTL_DS2_en_csv_v2_3158886.csv'
meta = pd.read_csv(meta_file)

A snippet of the DataFrames should be similar to this.

And the meta data.

Step 2: Combining the DataFrames with Merge

One good way to combine data is by using merge, which is easy to use, if both DataFrames has the same column name to combine it on.

dataset = data.merge(meta, how='inner', on='Country Code')

This shows the last columns of the new DataFrame from dataset.

Step 3: Showing the new enriched data

Now we can use the new dataset with the new columns.

One thing you can do, is using the groupby.

dataset.groupby('Region').sum()['2020'].plot.bar()

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
Data Science

Load files with pandas: CSV and Excel and Parquet files

What will we cover?

In this tutorial you will learn to load CSV, Excel and Parquet files into a pandas DataFrame.

Step 1: Load CSV file into a pandas DataFrame

A CSV file is a Comma-Separated Values file.

comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.  (wiki).

Wikipedia

A CSV file is a typical way to store tabular data in plain text. Each line will have the same number of fields.

This resembles a tabel in a Database.

Because of simplicity, the CSV files is a common exchange format. They are easy to use.

Also, see this lesson to learn about CSV files.

To use pandas to read a CSV file. See the following example.

import pandas as pd

file = 'https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/aapl.csv'
data = pd.read_csv(file, parse_dates=True, index_col='Date')

print(data.head())

You can change file to any CSV file on your storage. Here we use one from GitHub.

A few good parameters.

  • parse_dates=True: will parse dates and keep them as datetime object in the DataFrame. This is convenient if you want to take advantage of using the DataFrame with, for example, a DatetimeIndex.
  • index_coll=’Date’: You can use an integer to set the index of the column or the column name. This sets the index.
  • sep=’;’: This can set a different separator than the default.
  • Full documentation read_csv(): read a comma-separated values (csv) file into pandas DataFrame.

Step 2: Load Excel files into a pandas DataFrame

Do we need to introduce what an Excel file is?

import pandas as pd

file = 'https://github.com/LearnPythonWithRune/DataScienceWithPython/blob/main/files/aapl.xlsx?raw=true'

data = pd.read_excel(file, index_col='Date')

print(data.head())

You can change file to point at any Excel file on your computer.

  • read_excel() Read an Excel file into a pandas DataFrame.
  • index_col=’Date’: Works the same way as for read_csv().

Step 3: Load Parquet file into a pandas DataFrame

Many do not know Parquet files. A Parquet file is a free open source format.

The advantage of Parquet file is, that it is compressed and you can filter the rows while reading from the file.

import pandas as pd

file = 'https://github.com/LearnPythonWithRune/DataScienceWithPython/blob/main/files/aapl.parquet?raw=true'
data = pd.read_parquet(file)

print(data.head())

Notice, that here the index is set by the Parquet file. Hence, you do not need to set it.

  • read_parquet() Load a parquet object from the file path, returning a DataFrame.

Great Places to Find Data

Now you are hooked and want to find a lot of great datasets to work with.

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
Data Science

How to use SQLite Database with Pandas

What will we cover?

In this tutorial you will learn how to connect to a SQLite Database using pandas.

  • What is a SQLite database?
  • A few SQLite datasets to play with.
  • How to use sqlite3 connector with pandas.
  • A few useful SQL statements.

Step 1: What is a SQLite database?

There are different types of databasis, but here we will only learn about the relational database model.

Simply said, a relational database is like a collection of DataFrames with rows of data over the same columns. Each column has a datatype, just like a DataFrame.

What makes a database relational, is, that there are pre-defined relationships between them. The data is organized in one or mor tables (or relations) of columns and rows, with a unique key identifying each row.

Later you will see how these relationships can be used to combine tables.

SQLite database software library that provides a relational database management system. It has a lightweight to setup, administrate, and requires low resources.

Therefore the SQLite database is poplar way to have databases on smartphone, small units, or just sharing data in projects.

Step 2: Get a SQLite dataset

To start work with a SQLite database you need a dataset to work with.

You can download the Dallas Officer-Involved Shootings SQLite database here: Download. We will use this dataset as our example to demonstrate it.

The database has three tables: incidents, officers, and subjects.

Other SQLite datasets

Step 3: Database connector

You need a connector to interact with the database.

We will use the sqlite3 is an interface for SQLite databases. No installation is needed, which also makes it nice to work with SQLite databases.

If you work with other databases there are other connectors.

Let’s try some code.

import sqlite3
conn = sqlite3.connect('files/dallas-ois.sqlite')

cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table' ORDER BY name")
print(cursor.fetchall())

Here we assume that the downloaded file from Step 2 is put in a folder files.

  • First, we create a connection to it.
  • Then make a cursor.
  • Then we execute an SQL statement.
  • The we fetch all and print it out.
[('incidents',), ('officers',), ('subjects',)]

You can get a description of the table in SQLite as follows.

print(cursor.execute("PRAGMA table_info(officers)").fetchall())

Notice, we combine execute and fetchall() here.

[(0, 'case_number', 'TEXT', 0, None, 0),
 (1, 'race', 'TEXT', 0, None, 0),
 (2, 'gender', 'TEXT', 0, None, 0),
 (3, 'last_name', 'TEXT', 0, None, 0),
 (4, 'first_name', 'TEXT', 0, None, 0),
 (5, 'full_name', 'TEXT', 0, None, 0)]

Here we see the column names of the table and the types. All types are TEXT.

Step 4: A small introduction to SQL syntax.

When you use pandas with databases, often you are only interested in getting the tables over in DataFrames.

This, luckily, limits the number of SQL queries you need to master.

Get all data from table.

SELECT * FROM table_name

Sometimes you want to limit the number for rows you get, because the dataset might be huge and it takes time. Hence, while working with the model you want to create, you limit the number of rows from the table.

Here we limit it to only 100 first rows of the table.

SELECT * FROM table_name LIMIT 100

Sometimes you are only interested in specific data, and there is no use to extract all the data from the database. Then you can filter with a WHERE clause in your SQL syntax.

SELECT * FROM table_name WHERE column_name > 1

Step 5: Import data into a DataFrame

To read the data into a DataFrame can be done by using the connection we created.

import pandas as pd

officers = pd.read_sql('SELECT * FROM officers', conn)

Then you have all the data in the DataFrame officers.

print(officers.head())

Step 6: SQL join syntax to combine tables

It can be convenient to combine data directly from the Database.

This can be done by using JOIN syntax.

(INNER) JOIN: returns records that have matching values in both tables

SELECT * FROM table_1 JOIN table_2 ON table_1.column_name_1=table_2.column_name_2

LEFT JOIN: returns all records from the left table, and the matched records from the right table

SELECT * FROM table_1 LEFT JOIN table_2 ON table_1.column_name_1=table_2.column_name_2

Let’s try.

officers = pd.read_sql('SELECT * FROM officers JOIN incidents ON officers.case_number=incidents.case_number', conn)

print(officers.head())

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
Data Science
Exit mobile version