What will you learn?
When trying to understand data, visualization is the key to fast understand it!
Data visualization has 3 purposes.
- Data Quality: Finding outliers and missing data.
- Data Exploration: Understand the data.
- Data Presentation: Present the result.
Here you will learn 11 useful charts to understand your data and they are done in one line of code.
The data we will work with
We need some data to work with.
You can either download the Notebook and csv-file (GitHub repo) or read it directly from repository as follows.
import pandas as pd import matplotlib.pyplot as plt file_url = 'https://raw.githubusercontent.com/LearnPythonWithRune/pandas_charts/main/air_quality.csv' data = pd.read_csv(file_url, index_col=0, parse_dates=True) print(data)
This will output the first 5 lines of the data.
Now let’s use the data we have in the DataFrame data.
If you are new to pandas, I suggest you get an understanding of them from this guide.
#1 Simple plot
A simple plot is the default to use unless you know what you want. It will demonstrate the nature of the data.
Let’s try to do it here.
As you notice, there are three columns of data for the 3 stations: Antwerp, Paris, and London.
The data is a datetime series, meaning, that each data point is part of a time series (the x-axis).
It is a bit difficult to see if station Antwerp has a full dataset.
Let’s try to figure that out.
#2 Isolated plot
This leads us to making an isolated plot of only one column. This is handy to understand each individual column of data better.
Here we were a bit curious about if the data of station Antwerp was given for all dates.
This shows that our suspicion was correct. The time series is not covering the full range for station Antwerp.
This tells us about the data quality, which might be crucial for further analysis.
You can do the same for the other two columns.
#3 Scatter Plot
A great way to see if there is a correlation of data, is to make a scatter plot.
Let’s demonstrate how that looks like.
data.plot.scatter(x='station_london', y='station_paris', alpha=.25)
You see that data is not totally scattered all over, but is not fully correlated either. This means, that there is come weak correlation of the data and it is not fully independent of each other.
#4 Box Plot
One way to understand data better is by a box plot. It might need a bit of understanding of simple statistics.
Let’s first take a look at it.
The box plot shows the following.
To understand what outliers, min, median, max, and so forth means, I would suggest you read this simple statistic guide.
#5 Area Plot
An area plot can show you the data in a great way to see how the values follow each other in a visual easy way to get an understanding of values, correlation, and missing data.
#6 Bar plots
Bar plots can be useful, but often when the data is more limited.
Here you see a bar plot of the first 15 rows of data.
#7 Histograms for single column
Histograms will show you what data is most common. It shows the frequencies of data divided into bins. By default there are 10 bins of data.
It is an amazing tool to get a fast view of the number of occurrences of each data range.
Here first for an isolated station.
#8 Histograms for multiple columns
Then for all three stations, where you see it with transparency (alpha).
Pie charts are very powerful, when you want to show a division of data.
How many percentage belong to each category.
Here you see the mean value of each station.
#10 Scatter Matrix Plot
This is a great tool for showing data for combined in all possible ways. This will show you correlations and how data is distributed.
You need to import an additional library, but it gives you fast understanding of data.
from pandas.plotting import scatter_matrix scatter_matrix(data, alpha=0.2, figsize=(6, 6))
#11 Secondary y-axis
Finally, sometimes you want two plots on the same chart. The problem can be, that the two plots have very different ranges. hence, you would like to have two different y-axes, with different ranges.
This will enable you to have plots on the same chart with different ranges.
Want to learn more?
If you want to learn more about Data Science to become a successful Data Scientist?
Then check my free Expert Data Science Blueprint course with the following resources.
- 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduces projects, and shows a solution (YouTube video).
- 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
- 15 projects – structured with the Data Science Workflow and a solution explained at the end of video lessons (GitHub).
Do you know what the 5 key success factors every programmer must have?
How is it possible that some people become programmer so fast?
While others struggle for years and still fail.
Not only do they learn python 10 times faster they solve complex problems with ease.
What separates them from the rest?
I identified these 5 success factors that every programmer must have to succeed:
- Collaboration: sharing your work with others and receiving help with any questions or challenges you may have.
- Networking: the ability to connect with the right people and leverage their knowledge, experience, and resources.
- Support: receive feedback on your work and ask questions without feeling intimidated or judged.
- Accountability: stay motivated and accountable to your learning goals by surrounding yourself with others who are also committed to learning Python.
- Feedback from the instructor: receiving feedback and support from an instructor with years of experience in the field.
I know how important these success factors are for growth and progress in mastering Python.
That is why I want to make them available to anyone struggling to learn or who just wants to improve faster.
With the Python Circle community, you can take advantage of 5 key success factors every programmer must have.
Be part of something bigger and join the Python Circle community.