What will you learn?
When trying to understand data, visualization is the key to fast understand it!
Data visualization has 3 purposes.
- Data Quality: Finding outliers and missing data.
- Data Exploration: Understand the data.
- Data Presentation: Present the result.
Here you will learn 11 useful charts to understand your data and they are done in one line of code.
The data we will work with
We need some data to work with.
You can either download the Notebook and csv-file (GitHub repo) or read it directly from repository as follows.
import pandas as pd import matplotlib.pyplot as plt file_url = 'https://raw.githubusercontent.com/LearnPythonWithRune/pandas_charts/main/air_quality.csv' data = pd.read_csv(file_url, index_col=0, parse_dates=True) print(data)
This will output the first 5 lines of the data.
Now let’s use the data we have in the DataFrame data.
If you are new to pandas, I suggest you get an understanding of them from this guide.
#1 Simple plot
A simple plot is the default to use unless you know what you want. It will demonstrate the nature of the data.
Let’s try to do it here.
As you notice, there are three columns of data for the 3 stations: Antwerp, Paris, and London.
The data is a datetime series, meaning, that each data point is part of a time series (the x-axis).
It is a bit difficult to see if station Antwerp has a full dataset.
Let’s try to figure that out.
#2 Isolated plot
This leads us to making an isolated plot of only one column. This is handy to understand each individual column of data better.
Here we were a bit curious about if the data of station Antwerp was given for all dates.
This shows that our suspicion was correct. The time series is not covering the full range for station Antwerp.
This tells us about the data quality, which might be crucial for further analysis.
You can do the same for the other two columns.
#3 Scatter Plot
A great way to see if there is a correlation of data, is to make a scatter plot.
Let’s demonstrate how that looks like.
data.plot.scatter(x='station_london', y='station_paris', alpha=.25)
You see that data is not totally scattered all over, but is not fully correlated either. This means, that there is come weak correlation of the data and it is not fully independent of each other.
#4 Box Plot
One way to understand data better is by a box plot. It might need a bit of understanding of simple statistics.
Let’s first take a look at it.
The box plot shows the following.
To understand what outliers, min, median, max, and so forth means, I would suggest you read this simple statistic guide.
#5 Area Plot
An area plot can show you the data in a great way to see how the values follow each other in a visual easy way to get an understanding of values, correlation, and missing data.
#6 Bar plots
Bar plots can be useful, but often when the data is more limited.
Here you see a bar plot of the first 15 rows of data.
#7 Histograms for single column
Histograms will show you what data is most common. It shows the frequencies of data divided into bins. By default there are 10 bins of data.
It is an amazing tool to get a fast view of the number of occurrences of each data range.
Here first for an isolated station.
#8 Histograms for multiple columns
Then for all three stations, where you see it with transparency (alpha).
Pie charts are very powerful, when you want to show a division of data.
How many percentage belong to each category.
Here you see the mean value of each station.
#10 Scatter Matrix Plot
This is a great tool for showing data for combined in all possible ways. This will show you correlations and how data is distributed.
You need to import an additional library, but it gives you fast understanding of data.
from pandas.plotting import scatter_matrix scatter_matrix(data, alpha=0.2, figsize=(6, 6))
#11 Secondary y-axis
Finally, sometimes you want two plots on the same chart. The problem can be, that the two plots have very different ranges. hence, you would like to have two different y-axes, with different ranges.
This will enable you to have plots on the same chart with different ranges.
Want to learn more?
Want to learn more about Data Science to become a successful Data Scientist?
Then check my free Expert Data Science Blueprint course with the following resources.
- 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
- 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
- 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).