How to use Multiple Linear Regression to Predict House Prices

What will we cover?

  • Learn about Multiple Linear Regression
  • Understand difference from discrete classifier
  • Understand it is Supervised learning task
  • Get insight into how similar a linear classifier is to discrete classifier
  • Hands-on experience with multiple linear regression

Step 1: What is Multiple Linear Regression?

Multiple Linear Regression is a Supervised learning task of learning a mapping from input point to a continuous value.

Wow. What does that mean?

This might not help all, but it is the case of a Linear Regression, where there are multiple explanatory variables.

Let’s start simple – Simple Linear Regression is the case most show first. It is given one input variable (explanatory variable) and one output value (response value).

An example could be – if the temperatur is X degrees, we expect to sell Y ice creams. That is, it is trying to predict how many ice creams we sell if we are given a temperature.

Now we know that there are other factors that might have high impact other that the temperature when selling ice cream. Say, is it rainy or sunny. What time of year it is, say, it might be turist season or not.

Hence, a simple model like that might not give a very accurate estimate.

Hence, we would like to model having more input variables (explanatory variables). When we have more than one it is called Multiple Linear Regression.

Step 2: Get Example Data

Let’s take a look at some house price data.

import pandas as pd

data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/files/house_prices.csv')
print(data.head())

Notice – you can also download the file locally from the GitHub. This will make it faster to run every time.

The output should be giving the following data.

The goal is given a row of data we want to predict the House Unit Price. That is, given all but the last column in a row, can we predict the House Unit Price (the last column).

Step 3: Plot the data

Just for fun – let’s make a scatter plot of all the houses with Latitude and Longitude.

import matplotlib.pyplot as plt

fig, ax = plt.subplots()

ax.scatter(x=data['Longitude'], y=data['House unit price'])
plt.show()

This gives the following plot.

This shows you where the houses are located, which can be interesting because house prices can be dependent on location.

Somehow it should be intuitive that the longitude and latitude should not be linearly correlated to the house price – at least not in the bigger picture.

Step 4: Correlation of the features

Before we make the Multiple Linear Regression, let’s see how the features (the columns) correlate.

data.corr()

Which gives.

This is interesting. Look at the lowest row for the correlations with House Unit Price. It shows that Distance to MRT stations negatively correlated – that is, the longer to a MRT station the lower price. This might not be surprising.

More surprising is that Latitude and Longitude are actually comparably high correlated to the House Unit Price.

This might be the case for this particular dataset.

Step 5: Check the Quality of the dataset

For the Linear Regression model to perform well, you need to check that the data quality is good. If the input data is of poor quality (missing data, outliers, wrong values, duplicates, etc.) then the model will not be very reliable.

Here we will only check for missing values.

data.isnull().sum()

Which gives.

Transaction                     0
House age                       0
Distance to MRT station         0
Number of convenience stores    0
Latitude                        0
Longitude                       0
House unit price                0
dtype: int64

This tells us that there are no missing values.

If you want to learn more about Data Quality, then check out the free course on Data Science. In that course you will learn more about Data Quality and how it impacts the accuracy of your model.

Step 6: Create a Multiple Linear Regression Model

First we need to divide them into input variables X (explanatory variables) and output values y (response values).

Then we split it into a training and testing dataset. We create the model, we fit it, we use it predict the test dataset and get a score.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

X = data.iloc[:,:-1]
y = data.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=.15)

lin = LinearRegression()
lin.fit(X_train, y_train)

y_pred = lin.predict(X_test)

print(r2_score(y_test, y_pred))

For this run it gave 0.68.

Is that good or bad? Well, good question. The perfect match is 1, but that should not be expected. The worse score you can get is minus infinite – so we are far from that.

In order to get an idea about it – we need to compare it with variations.

In the free Data Science course we explore how to select features and evaluate models. It is a great idea to look into that.

Want to learn more?

This is part of a FREE 10h Machine Learning course with Python.

  • 15 video lessons – which explain Machine Learning concepts, demonstrate models on real data, introduce projects and show a solution (YouTube playlist).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – with step guides to help you structure your solutions and solution explained in the end of video lessons (GitHub).

Leave a Reply Cancel reply

Exit mobile version