# The Ultimate Statistical Guide for Data Science using pandas

## What will we cover?

In this tutorial you will learn all the statistics you need to get started with Data Science.

• What is statistics?
• An analysis and interpretation of data.
• A way to communicate findings.
• Why do you need statistics?
• Statistics presents information in an easy way.
• Gives you an understanding of the data.

## Step 1: Example of statistics – the most important statistics you need

Most get surprised by what the most important statistical number is.

But let’s dive into an example.

```import pandas as pd
```

### Count

• Count is a descriptive statistics and counts observations.
• Count is the most used in statistics and has high importance to evaluate findings.
• Example: Making conclusion on childhood weights and the study only had 12 childing (observations). Is that trustworthy?
• The count says something about the quality of the study

As pointed out, count is the most important statistics in any study. If you made a study based on 3 samples, could you make any general conclusions? Say, you make a check on make height and you have 3 samples. Can you conclude what the average height is from that study?

No, you need more samples. Hence, count is the most important statistics you need.

You can get the count of samples by using groupby.

```data.groupby('Gender').count()
```

This shows the number of samples in each group.

## Step 2: Mean

Most know what the average value is. This is also called the mean value. Hence, if the mean value of height in the samples are 69, then this is the average value.

You can also get that with groupby.

```data.groupby('Gender').mean()
```

## Step 3: Standard Deviation

What mean doesn’t tell, is the spread of the data.

Let’s try to visualize what I mean.

```data[data['Gender'] == 'Male']['Height'].plot.hist(bins=20)
```

Data could be more spread, meaning, that the samples could be more spread out than you see on this picture. On the other hand, they could also be more together.

What the standard deviation tells you is how data is distributed away from the mean value.

• Standard deviation is a measure of how dispersed (spread) the data is in relation to the mean.
• Low standard deviation means data is close to the mean.
• High standard deviation means data is spread out.

You can get the values with your DataFrame as well.

```data.groupby('Gender').std()
```

## Step 4: Describe

The method describe in a pandas DataFrame gives you a lot of useful information.

• Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
• See docs
```data.describe()
```

It gives the count, mean, standard deviation, as well as min and max, where the first 25%, 50% and 75% are between.

This is a detailed description of the data.

## Step 5: Box Plots

Understanding the describe statistics will make it easy to understand box plots.

• Box plots is a great way to visualize descriptive statistics
• Notice that Q1: 25%, Q2: 50%, Q3: 75%

You can get that from your DataFrame as well.

```data['Weight'].plot.box(vert=False)
```

You can get it a bit more handy by using this box plot instead.

```data.boxplot(column=['Height', 'Weight'])
```

And even by gender like this.

```data.boxplot(column=['Height', 'Weight'], by='Gender')
```

## Step 6: Correlation

Correlation is a great way to find if data is somehow correlated.

Remember the saying: Correlation is not causation.

Measure the relationship between two variables and ranges from -1 to 1

A great way to undersand the numbers is by scatter plots.

Let’s check our data.

```data.plot.scatter(x='Height', y='Weight', alpha=.1)
```

And try to calculate the correlation.

```data.corr()
```

You can also groupby by Gender.

```data.groupby('Gender').corr()
```

This basically covers the statistics you need to know and how you can easily do them with pandas DataFrames.

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

• 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
• 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
• 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).

## 3 Replies to “The Ultimate Statistical Guide for Data Science using pandas”

1. Alexandra says:

Rune, I found your amazing ML videos on YouTube. I just learned about your Python for finance course. I’m curious to know if the course is more about teaching basic concepts/theory or will it contain content that can be used right away? Do you have working projects that have successfully evaluated whether stocks are over or undervalued?

1. Rune says:

Hi Alexandra,
Great to connect.
In the course you will be introduced to the concepts to evaluate stocks if they are over or undervalued.
This is a practical course, which introduces concepts one-by-one and makes the calculations using Python (with primarily pandas). The code is introduced and available for you. To better understand all concepts, there are also exercises (programming), which will let you do the work too.
At the end, you will have a framework to evaluate stocks for over and undervaluation – and it will also teach you how to make automated technical analysis if interested.
In short, yes, you will get what you are asking.
Rune

2. Alexandra says:

I’m convinced! I just purchased the course. I appreciate your straight to the point, simple explanation teaching style. Thank you for sharing your wisdom with others!

If you are ever in the market for all in one audio and video editing software for your courses, I highly recommend Descript(https://www.descript.com/).

Cheers,
Alex