What will we cover?
In this tutorial you will learn all the statistics you need to get started with Data Science.
- What is statistics?
- An analysis and interpretation of data.
- A way to communicate findings.
- Why do you need statistics?
- Statistics presents information in an easy way.
- Gives you an understanding of the data.
Step 1: Example of statistics – the most important statistics you need
Most get surprised by what the most important statistical number is.
But let’s dive into an example.
import pandas as pd data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/weight-height.csv') print(data.head())
- Count is a descriptive statistics and counts observations.
- Count is the most used in statistics and has high importance to evaluate findings.
- Example: Making conclusion on childhood weights and the study only had 12 childing (observations). Is that trustworthy?
- The count says something about the quality of the study
As pointed out, count is the most important statistics in any study. If you made a study based on 3 samples, could you make any general conclusions? Say, you make a check on make height and you have 3 samples. Can you conclude what the average height is from that study?
No, you need more samples. Hence, count is the most important statistics you need.
You can get the count of samples by using groupby.
This shows the number of samples in each group.
Step 2: Mean
Most know what the average value is. This is also called the mean value. Hence, if the mean value of height in the samples are 69, then this is the average value.
You can also get that with groupby.
Step 3: Standard Deviation
What mean doesn’t tell, is the spread of the data.
Let’s try to visualize what I mean.
data[data['Gender'] == 'Male']['Height'].plot.hist(bins=20)
Data could be more spread, meaning, that the samples could be more spread out than you see on this picture. On the other hand, they could also be more together.
What the standard deviation tells you is how data is distributed away from the mean value.
- Standard deviation is a measure of how dispersed (spread) the data is in relation to the mean.
- Low standard deviation means data is close to the mean.
- High standard deviation means data is spread out.
You can get the values with your DataFrame as well.
Step 4: Describe
The method describe in a pandas DataFrame gives you a lot of useful information.
- Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
- See docs
It gives the count, mean, standard deviation, as well as min and max, where the first 25%, 50% and 75% are between.
This is a detailed description of the data.
Step 5: Box Plots
Understanding the describe statistics will make it easy to understand box plots.
- Box plots is a great way to visualize descriptive statistics
- Notice that Q1: 25%, Q2: 50%, Q3: 75%
You can get that from your DataFrame as well.
You can get it a bit more handy by using this box plot instead.
And even by gender like this.
data.boxplot(column=['Height', 'Weight'], by='Gender')
Step 6: Correlation
Correlation is a great way to find if data is somehow correlated.
Remember the saying: Correlation is not causation.
Measure the relationship between two variables and ranges from -1 to 1
A great way to undersand the numbers is by scatter plots.
Let’s check our data.
data.plot.scatter(x='Height', y='Weight', alpha=.1)
And try to calculate the correlation.
You can also groupby by Gender.
This basically covers the statistics you need to know and how you can easily do them with pandas DataFrames.
Want to learn more?
Want to learn more about Data Science to become a successful Data Scientist?
This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.
- 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
- 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
- 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).