What will we cover?
You will learn what Feature Selection is and why it matters. It will be demonstrated on a dataset.
- How Feature Selection gives you higher accuracy.
- That Feature Selection gives simpler models.
- It minimized risk of overfitting the models.
- Learn the main Feature Selection Techniques.
- That Filter Methods are independent of the model.
- This includes removing Quasi-constant features.
- How removing correlated features improves the model.
- That Wrapper Methods are similar to a search problem.
- Forward Selection works for Classification and Regression.
Step 1: What is Feature Selection?
Feature Selection can be explained as follows.
- Feature selection is about selecting attributes that have the greatest impact towards the problem you are solving.
- Notice: It should be clear that all steps are interconnected.
- Higher accuracy
- Simpler models
- Reducing overfitting risk
See more details on wikipedia
Step 2: Feature Selection Techniques
On a high level there are 3 types of Feature Selection Techniques.
- Independent of Model
- Based on scores of statistical
- Easy to understand
- Good for early feature removal
- Low computational requirements
- Compare different subsets of features and run the model on them
- Basically a search problem
See more on wikipedia
- Find features that contribute most to the accuracy of the model while it is created
- Regularization is the most common method – it penalizes higher complexity
Step 3: Preparation for Feature Selection
It should be noted that there are some steps before Feature Selection.
- Clean data (Learn about it here)
- Divide into training and test set (Learn about it here)
- Feature scaling (Learn about it here)
It should also be clear that feature selection should only be done on training data, as you should assume no knowledge of the testing data.
Step 4: Filter Method – Quasi-constant features
Let’s try an example by removing quasi-constant features. Those a features that are almost constant. It should be clear that features that are constant all the time do not provide any value. Features that are almost the same value all the time, also provide little value.
To do that we use the following.
- Remove constant and quasi constant features
VarianceThresholdFeature selector that removes all low-variance features.
import pandas as pd from sklearn.feature_selection import VarianceThreshold data = pd.read_parquet('https://github.com/LearnPythonWithRune/DataScienceWithPython/raw/main/files/customer_satisfaction.parquet') sel = VarianceThreshold(threshold=0.01) sel.fit_transform(data) quasi_constant = [col for col in data.columns if col not in sel.get_feature_names_out()] len(quasi_constant)
This reveals that actually 97 of the features are more than 99% constant.
Step 5: Filter Method – Correlated features
The goal is to find and remove correlated features as they give the same value for the most part. Hence, they do not contribute much.
- Calculate correlation matrix (assign it to
- A feature is correlated to any previous features if the following is true
- Notice that we use correlation 0.8
feature = 'imp_op_var39_comer_ult1' (corr_matrix[feature].iloc[:corr_matrix.columns.get_loc(feature)] > 0.8).any()
- Notice that we use correlation 0.8
- Get all the correlated features by using list comprehension
train = data[sel.get_feature_names_out()] corr_matrix = train.corr() corr_features = [feature for feature in corr_matrix.columns if (corr_matrix[feature].iloc[:corr_matrix.columns.get_loc(feature)] > 0.8).any()]
This will get the correlated features that are more than 0.8 correlated.
Step 6: Wrapper Method – Forward Selection
SequentialFeatureSelectorSequential Feature Selection for Classification and Regression.
- First install it by running the following in a terminal
pip install mlxtend
- For preparation remove all quasi-constant features and correlated features
X = data.drop(['TARGET'] + quasi_features + corr_features, axis=1) y = data['TARGET']
- To demonstrate this we create a small training set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.75, random_state=42)
- We will use the
SVCmodel with the
- For two features
from sklearn.model_selection import train_test_split from sklearn.svm import SVC from mlxtend.feature_selection import SequentialFeatureSelector as SFS X = data.drop(['TARGET'] + quasi_constant + corr_features, axis=1) y = data['TARGET'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.9, random_state=42) sfs = SFS(SVC(), k_features=2, verbose=2, cv=2, n_jobs=8) sfs.fit(X_train, y_train)
Now that shows a few simple ways to make feature selection.
Want to learn more?
Want to learn more about Data Science to become a successful Data Scientist?
This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.
- 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
- 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
- 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).