How to Clean Data using pandas DataFrames

What will we cover?

What cleaning data is and how it relates to data quality. This guide will show you how to deal with missing data by replacing and interpolate data. How to deal with data outliers and removing duplicates.

Step 1: What is Clearning Data?

Clearning Data requires domain knowledge of the data.

Data Quality is often a measure of how good data is for further analysis or how solid conclusions we can make. Cleaning data can improve the data quality.

If we understand what is meant by Data Quality – for the data we work with, it becomes easier to clean it. The goal of cleaning is to improve the Data Quality and hence, give better results of our data analysis.

  • Improve the quality (if possible)
  • Dealing with missing data (both rows in single entries)
    • Examples include 
      • Replacing missing values/entries with mean values
      • Interpolation of values (in time series)
  • Dealing with data outliers
    • Examples include 
      • Default missing values in system: sometimes as 0-values
      • Wrong values
  • Removing duplicates
    • Common problem to have duplicate entries
  • Process requires domain knowledge

Step 2: Missing Data

A common issue of Data Quality is missing data. This can be fields that are missing and are often easy to detect. In pandas DataFrames they are often represented by NA.

  • A great source to learn about is here.
  • Two types of missing data we consider
    1. NaN data
    2. Rows in time series data

Type 1 is data with NA or NaN.

import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [np.nan, 2, 3], 'b': [4, 5, np.nan]})
df

Type two is missing rows of data.

df = pd.DataFrame([i for i in range(10)], columns=['Data'], index=pd.date_range("2021-01-01", periods=10))
df = df.drop(['2021-01-03', '2021-01-05', '2021-01-06'])
df

You see we are missing obvious data here (missing date).

Step 3: Outliers

Outliers require deeper domain knowledge to spot.

But let’s take an example here.

df = pd.DataFrame({'Weight (kg)': [86, 83, 0, 76, 109, 95, 0]})
df

Here we know that you cannot weigh 0 kg, hence there must be an error in the data.

Step 4: Demonstrating how it affects the Machine Learning models

Let’s dig a bit deeper into it and see if data quality makes any difference.

  • Housing Prices Competition for Kaggle Learn Users
  • The dataset contains a training and testing dataset.
    • The goal is to predict prices on the testing dataset.
  • We will explore how dealing with missing values impacts the prediction of a linear regression model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/home-data/train.csv', index_col=0)
data.head()

We can remove non-numeric values in this example as follows and check for missing values afterwards.

data = data.select_dtypes(include='number')

The missing values are listed as follows.

data.info()

(output not given here).

Let’s make a helper function to calculate the r-square score of a linear regression model. This way we can see how the model will behave differently.

def regression_score(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    lin = LinearRegression()
    lin.fit(X_train, y_train)
    y_pred = lin.predict(X_test)
    return r2_score(y_pred, y_test)

Let’s try some different approaches.

Calculations

  • Try first to calcualte the r-square by using data.dropna()
    • This serves as the ussual way we have done it
  • Then with data.fillna(data.mean())
    • fillna() Fill NA/NaN values using the specified method.
  • Then with data.fillna(data.mode().iloc[0])

Just delete rows with missing data.

test_base = data.dropna()
regression_score(test_base.drop('SalePrice', axis=1), test_base[['SalePrice']])

This gives around 0.65 in score.

Then fill with the mean value.

test_base = data.fillna(data.mean())
regression_score(test_base.drop('SalePrice', axis=1), test_base[['SalePrice']])

This gives 0.74, which is a great improvement.

Try with the mode (the most common value).

test_base = data.fillna(data.mode().iloc[0])
regression_score(test_base.drop('SalePrice', axis=1), test_base[['SalePrice']])

This gives 0.75 a bit better.

Feel free to experiment more, but this should demonstrate that just removing rows with missing data is not a great idea.

Step 5: Dealing with Time Series data

If you work time series data you can often do better.

weather = pd.read_parquet('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/weather.csv')
weather.head()

Missing time series rows

  • One way to find missing rows of data in a time series is as followsidx = pd.Series(data=pd.date_range(start=df.index.min(), end=df.index.max(), freq="H")) mask = idx.isin(df.index) idx[~mask]

This can be done as follows.

idx = pd.Series(data=pd.date_range(start=weather.index.min(), end=weather.index.max(), freq="H"))
w_idx = weather.reindex(idx)
w_idx.interpolate()[w_idx['Summary'].isna()]

This will interpolate values with this.

  • To insert missing datetimes we can use reindex()
  • To interploate values that are missing interpolate

Outliers

  • If we focus on Pressure (millibars) for `2006′
  • One way to handle 0-values is with replace().replace(0, np.nan)
  • Then we can apply interploate()
p_2006 = weather['Pressure (millibars)'].loc['2006']
p_2016.plot()

Here we see that the data is there, but it is zero.

What to do then?

Again interpolate can be used.

p_2016.replace(0, np.nan).interpolate().plot()

Step 6: Dealing with duplicates

Sometimes your data has duplicates. This is a big issue for your model.

Luckily this can be dealt with quite easy.

drop_duplicates() Return DataFrame with duplicate rows removed.

df = pd.DataFrame({'a': [1, 2, 3, 2], 'b': [11, 2, 21, 2], 'c': [21, 2, 31, 2]})
df
df.drop_duplicates()

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).

Leave a Reply

%d bloggers like this: