How to make Feature Selection with pandas DataFrames

What will we cover?

You will learn what Feature Selection is and why it matters. It will be demonstrated on a dataset.

  • How Feature Selection gives you higher accuracy.
  • That Feature Selection gives simpler models.
  • It minimized risk of overfitting the models.
  • Learn the main Feature Selection Techniques.
  • That Filter Methods are independent of the model.
  • This includes removing Quasi-constant features.
  • How removing correlated features improves the model.
  • That Wrapper Methods are similar to a search problem.
  • Forward Selection works for Classification and Regression.

Step 1: What is Feature Selection?

Feature Selection can be explained as follows.

Feature Selection

  • Feature selection is about selecting attributes that have the greatest impact towards the problem you are solving.
  • Notice: It should be clear that all steps are interconnected.

Why Feature Selection?

  • Higher accuracy
  • Simpler models
  • Reducing overfitting risk

See more details on wikipedia

Step 2: Feature Selection Techniques

On a high level there are 3 types of Feature Selection Techniques.

Filter methods

  • Independent of Model
  • Based on scores of statistical
  • Easy to understand
  • Good for early feature removal
  • Low computational requirements


Wrapper methods

  • Compare different subsets of features and run the model on them
  • Basically a search problem


See more on wikipedia

Embedded methods

  • Find features that contribute most to the accuracy of the model while it is created
  • Regularization is the most common method – it penalizes higher complexity


Feature Selection Resources

Step 3: Preparation for Feature Selection

It should be noted that there are some steps before Feature Selection.

It should also be clear that feature selection should only be done on training data, as you should assume no knowledge of the testing data.

Step 4: Filter Method – Quasi-constant features

Let’s try an example by removing quasi-constant features. Those a features that are almost constant. It should be clear that features that are constant all the time do not provide any value. Features that are almost the same value all the time, also provide little value.

To do that we use the following.

Using Sklearn

  • Remove constant and quasi constant features
  • VarianceThreshold Feature selector that removes all low-variance features.
import pandas as pd
from sklearn.feature_selection import VarianceThreshold

data = pd.read_parquet('')

sel = VarianceThreshold(threshold=0.01)

quasi_constant = [col for col in data.columns if col not in sel.get_feature_names_out()]

This reveals that actually 97 of the features are more than 99% constant.

Step 5: Filter Method – Correlated features

The goal is to find and remove correlated features as they give the same value for the most part. Hence, they do not contribute much.

  • Calculate correlation matrix (assign it to corr_matrix)
  • A feature is correlated to any previous features if the following is true
    • Notice that we use correlation 0.8feature = 'imp_op_var39_comer_ult1' (corr_matrix[feature].iloc[:corr_matrix.columns.get_loc(feature)] > 0.8).any()
  • Get all the correlated features by using list comprehension
train = data[sel.get_feature_names_out()]

corr_matrix = train.corr()

corr_features = [feature for feature in corr_matrix.columns if (corr_matrix[feature].iloc[:corr_matrix.columns.get_loc(feature)] > 0.8).any()]

This will get the correlated features that are more than 0.8 correlated.

Step 6: Wrapper Method – Forward Selection

  • SequentialFeatureSelector Sequential Feature Selection for Classification and Regression.
  • First install it by running the following in a terminal pip install mlxtend
  • For preparation remove all quasi-constant features and correlated featuresX = data.drop(['TARGET'] + quasi_features + corr_features, axis=1) y = data['TARGET']
  • To demonstrate this we create a small training setX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.75, random_state=42)
  • We will use the SVC model with the SequentialFeatureSelector.
    • For two features
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

X = data.drop(['TARGET'] + quasi_constant + corr_features, axis=1)
y = data['TARGET']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.9, random_state=42)

sfs = SFS(SVC(), k_features=2, verbose=2, cv=2, n_jobs=8), y_train)

Now that shows a few simple ways to make feature selection.

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
Data Science

Leave a Reply Cancel reply

Exit mobile version