A Smooth Introduction to Linear Regression using pandas

What will we cover?

Show what Linear Regression is visually and demonstrate it on data.

Step 1: What is Linear Regression

Simply said, you can describe Linear Regression as follows.

  • Given data input (independent variables) can we predict output (dependent variable)
  • It is the mapping from input point to a continuous value

I like to show it visually.

The goal of Linear Regression is to find the best fitting line. Hence, some data will be fitted better as it will be closer to the line.

The predictions will be on the line. That is, when you have fitted your Linear Regression model, it will predict new values to be on the line.

While this sounds simple, the model is one of the most used models and creates high value.

Step 2: Correlation and Linear Regression

Often there is a bit confusing between Linear Regression and Correlation. But they do different things.

Correlation is one number describing a relationship between tow variables. While Linear Regression is an equation used to predict values.

  • Correlation
    • Single measure of relationship between two variables.
  • Linear Regression
    • An equation used for prediction.
  • Similarities
    • Describes relationship between variables

Step 3: Example

Let’s try an example.

import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/weight-height.csv')
data.plot.scatter(x='Height', y='Weight', alpha=.1)

This data looks correlated. How would a Linear Regression prediction of it look like?

We can use Sklearn.

Sklearn

Linear Regression

  • The Linear Regression model takes a collection of observations
  • Each observation has featuers (or variables).
  • The features the model takes as input are called independent (often denoted with X)
  • The feature the model outputs is called dependent (often denoted with y)
from sklearn.linear_model import LinearRegression
# Creating a Linear Regression model on our data
lin = LinearRegression()
lin.fit(data[['Height']], data['Weight'])
# Creating a plot
ax = data.plot.scatter(x='Height', y='Weight', alpha=.1)
ax.plot(data['Height'], lin.predict(data[['Height']]), c='r')

To measure the accuracy of the prediction the r-squared function is often used, which you can access directly on the model by using the following code.

lin.score(data[['Height']], data['Weight'])

This will give 0.855, which is just a number you can use to compare to other samples.

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).

Leave a Reply

%d bloggers like this: