Categories: Data SciencePython

A Smooth Introduction to Linear Regression using pandas

What will we cover?

Show what Linear Regression is visually and demonstrate it on data.

Step 1: What is Linear Regression

Simply said, you can describe Linear Regression as follows.

  • Given data input (independent variables) can we predict output (dependent variable)
  • It is the mapping from input point to a continuous value

I like to show it visually.

The goal of Linear Regression is to find the best fitting line. Hence, some data will be fitted better as it will be closer to the line.

The predictions will be on the line. That is, when you have fitted your Linear Regression model, it will predict new values to be on the line.

While this sounds simple, the model is one of the most used models and creates high value.

Step 2: Correlation and Linear Regression

Often there is a bit confusing between Linear Regression and Correlation. But they do different things.

Correlation is one number describing a relationship between tow variables. While Linear Regression is an equation used to predict values.

  • Correlation
    • Single measure of relationship between two variables.
  • Linear Regression
    • An equation used for prediction.
  • Similarities
    • Describes relationship between variables

Step 3: Example

Let’s try an example.

import pandas as pd

data = pd.read_csv('')

data.plot.scatter(x='Height', y='Weight', alpha=.1)

This data looks correlated. How would a Linear Regression prediction of it look like?

We can use Sklearn.


Linear Regression

  • The Linear Regression model takes a collection of observations
  • Each observation has featuers (or variables).
  • The features the model takes as input are called independent (often denoted with X)
  • The feature the model outputs is called dependent (often denoted with y)
from sklearn.linear_model import LinearRegression

# Creating a Linear Regression model on our data
lin = LinearRegression()[['Height']], data['Weight'])

# Creating a plot
ax = data.plot.scatter(x='Height', y='Weight', alpha=.1)
ax.plot(data['Height'], lin.predict(data[['Height']]), c='r')

To measure the accuracy of the prediction the r-squared function is often used, which you can access directly on the model by using the following code.

lin.score(data[['Height']], data['Weight'])

This will give 0.855, which is just a number you can use to compare to other samples.

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).

Published by

Recent Posts

Learn Python FREE Online

Why learn Python? There are many reasons to learn Python, and that is the power…

3 days ago

How to Check if a Number is Even or Odd with Python

What will you learn? How to use the modulo operator to check if a number…

1 week ago

The Truth About Being a Python Software Contractor

There are a lot of Myths out there There are lot of Myths about being…

2 months ago

Do This and 10X Your Salary as a Software Engineer

To be honest, I am not really a great programmer - that is not what…

2 months ago

Ultimate Guide to the Data Science Career Path

What does it take to become a Data Scientist? Data Science is in a cross…

2 months ago

How to Setup a MySQL Server in Docker for Your Python Project

What will you learn? Need to setup a SQL server? You don’t need to install…

4 months ago