How to make Feature Scaling with pandas DataFrames

What will we cover?

In this guide you will learn what Feature Scaling is and how to do it using pandas DataFrames. This will be demonstrated on a weather dataset.

Step 1: What is Feature Scaling

  • Feature Scaling transforms values in the similar range for machine learning algorithms to behave optimal.
  • Feature Scaling can be a problems for Machine Learing algorithms on multiple features spanning in different magnitudes.
  • Feature Scaling can also make it is easier to compare results

Feature Scaling Techniques

  • Normalization is a special case of MinMaxScaler
    • Normalization: Converts values between 0-1(values - values.min())/(values.max() - values.min())
    • MinMaxScaler: Between any values
  • Standardization (StandardSclaer from sklearn)
    • Mean: 0, StdDev: 1(values - values.mean())/values.std()
    • Less sensitive to outliers

Machine Learning algorithms

  • Some algorithms are more sensitive than others
  • Distance-based algorithms are most effected by the range of features.

Step 2: Example of Feature Scaling

You will be working with a weather dataset and try to predict the weather tomorrow.

import pandas as pd

data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/weather.csv', index_col=0, parse_dates=True)

data.describe()

A subset of the description here.

You will first clean the data in a simple way. If you want to learn about cleaning data check this guide out.

Then we will split the data into train and test. If you want to learn about that – then check out this guide.

from sklearn.model_selection import train_test_split
import numpy as np

data_clean = data.drop(['RISK_MM'], axis=1)
data_clean = data_clean.dropna()

X = data_clean.select_dtypes(include='number')
y = data_clean['RainTomorrow']
y = np.array([0 if value == 'No' else 1 for value in y])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

Then let’s make a box plot to see the problem with the data.

X_train.plot.box(figsize=(20,5), rot=90)

The problem is that the data is in the same ranges – which makes it difficult for distance based Machine Learning models.

We need to deal with that.

Step 3: Normalization

Normalization transforms data into the same range.

  • MinMaxScaler Transform features by scaling each feature to a given range.
  • MinMaxScaler().fit(X_train) is used to create a scaler.
    • Notice: We only do it on training data
from sklearn.preprocessing import MinMaxScaler

norm = MinMaxScaler().fit(X_train)

X_train_norm = norm.transform(X_train)
X_test_norm = norm.transform(X_test)

pd.DataFrame(X_train_norm, columns=X_train.columns).plot.box(figsize=(20,5), rot=90)

As we see here then all the data is put into the same range form 0 to 1. This has the challenge that you see how the outliers might dominate the picture.

If you want to learn more about box plots and statistics – then see this introduction.

Step 4: Standardization

StandardScaler Standardize features by removing the mean and scaling to unit variance.

from sklearn.preprocessing import StandardScaler

scale = StandardScaler().fit(X_train)

X_train_stand = scale.transform(X_train)
X_test_stand = scale.transform(X_test)

pd.DataFrame(X_train_stand, columns=X_train.columns).plot.box(figsize=(20,5), rot=90)

This gives that the mean value is 0 and the standard deviation is 1. This can be a great way to deal with data that has a lot of outliers – like this one.

Step 4: Testing it on a Machine Learning model

Let’s test the different approaches on a Machine Learning model.

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score


score = []

trainX = [X_train, X_train_norm, X_train_stand]
testX = [X_test, X_test_norm, X_test_stand]

for train, test in zip(trainX, testX):
    svc = SVC()
    
    svc.fit(train, y_train)
    y_pred = svc.predict(test)

    score.append(accuracy_score(y_test, y_pred))

df_svr = pd.DataFrame({'Accuracy score': score}, index=['Original', 'Normalized', 'Standardized'])
df_svr

As you can see that both approaches do better than just leaving the data as it is.

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
Data Science

Leave a Reply Cancel reply

Exit mobile version