What will we cover?
We will investigate the 3 main purposes of Data Visualization as a Data Scientist.
- Data Quality: We will demonstrate with examples how you can identify faulty and wrongly formatted data with visualization.
- Data Exploration: This will teach you how to understand data better with visualization.
- Data Presentation: Here we explore the main purpose new users think of Data Visualization, to present the findings. This will focus on how to use Data Visualization to confirm your key findings.
But first, we will understand the power of Data Visualization – understand why it is such a powerful tool to master.
Step 1: The Power of Data Visualization
Let’s consider some data – you get it by the following code.
import pandas as pd sample = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/sample_corr.csv') print(sample)
The output is – can you spot any connection?
Let’s try to visualize the same data.
Matplotlib is an easy to use visualization library for Python.
import matplotlib.pyplot as plt sample.plot.scatter(x='x', y='y') plt.show()
Giving the following output.
And here it is easy to spot that there is some kind of correlation. And actually, you would be able to absorb this connection no matter how many datapoint was there, given the data has the same nature.
- Absorb information quickly
- Improve insights
- Make faster decisions
Step 2: Data Quality with Visualization
Data Quality is something many like to talk about – but unfortunately there is no precise universal definition of it and is rather context specific.
That said, it is a concept you need to understand as a Data Scientist.
In general Data Quality is about (not only)
- Missing data (often represented as NA-values)
- Wrong data (data which cannot be used)
- Different scaled data (e.g. data in different units without being specified)
Sometimes Data Quality is mixed with aspects witch included Data Wrangling – that could be extracting values from string representations.
Data Quality requires that you know something about the data.
Imagine we are considering a dataset of human heights in centimers. Then we can check that in a histogram.
data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/sample_height.csv') data.plot.hist() plt.show()
We immediately realize that some of the data is not correct.
Then we can get the data below 50 cm as follows.
print(data[data['height'] < 50])
Looking at that data, you might realize that could be data inserted in meters and not centimeters. This happens often if data is entered by humans, some might wrongly type in meters and not centimers.
This could mean, that the data is valid, it just need re-scaling.
Another example is checking for outliers – in this case wrong data.
Consider this dataset of human age.
data = pd.read_csv('files/sample_age.csv') data.plot.hist() plt.show()
And you see someone with age around 300.
Similarly, you can get it with data[data[‘age’] > 150] and see one of age 314 years.
As you realize that Data Visualization helps you fast to get an idea of Data Quality.
Step 3: Data Exploration with Data Visualization
We already get an idea that Data Visualization helps us to absorb information quickly, but not only that, it also helps us to improve insights in the data enabling us to make faster decisions.
Now we will consider data from the World Bank (The World Bank is a great source of datasets).
Let’s consider the dataset EN.ATM.CO2E.PC.
Now let’s consider some typical Data Visualizations. To get started, we need to read the data.
import pandas as pd import matplotlib.pyplot as plt data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0) print(data.head)
We see that each year has a row, and each column represent a country (we only see part of them here).
To create a simple plot you can apply the following on a DataFrame.
A great thing about this is how simple it is to create.
Adding a title and labels is straight forward.
title='Tilte'adds the title
xlabel='X label'adds or changes the X-label
ylabel='X label'adds or changes the Y-label
data['USA'].plot(title='CO2 per capita in USA', ylabel='CO2 per capita') plt.show()
Another thing you can do is adding ranges to the axis.
xlim=minSets the x-axis range
ylim=minSets the y-axis range
data['USA'].plot(title='CO2 per capita in USA', ylabel='CO2 per capita', ylim=0) plt.show()
If you want to compare two columns in the DataFrame you can do it as follows.
data[['USA', 'WLD']].plot(ylim=0) plt.show()
If you want to set the figure size of the plot, this can be done as follows.
figsize=(width, height)in inches
data[['USA', 'DNK', 'WLD']].plot(ylim=0, figsize=(20,6)) plt.show()
You can create a bar plot as follows.
.plot.bar()Create a bar plot
Bar plot with two columns.
data[['USA', 'WLD']].plot.bar(figsize=(20,6)) plt.show()
Plot a range.
.loc[from:to]apply this on the DataFrame to get a range (both inclusive)
data[['USA', 'WLD']].loc[2000:].plot.bar(figsize=(20,6)) plt.show()
You can create a histogram as follows.
.plot.hist()Create a histogram
bins=<number of bins>Specify the number of bins in the histogram.
data['USA'].plot.hist(figsize=(20,6), bins=7) plt.show()
You create a Pie Chart as follows.
.plot.pie()Creates a Pie Chart
df = pd.Series(data=[3, 5, 7], index=['Data1', 'Data2', 'Data3']) df.plot.pie() plt.show()
You can add values counts to your Pie Chart
- A simple chart of values above/below a threshold
.value_counts()Counts occurences of values in a Series (or DataFrame column)
- A few arguments to
colors=<list of colors>
labels=<list of labels>
autopct='%1.1f%%'sets percentages on chart
(data['USA'] < 17.5).value_counts().plot.pie(colors=['r', 'g'], labels=['>=17.5', '<17.5'], title='CO2 per capita', autopct='%1.1f%%') plt.show()
Assume we want to investigate if GDP per capita and CO2 per capita are correlated. Then a great way to get an idea about is by using a scatter plot.
Let’s try to do that. The data is available we just need to load it.
import pandas as pd import matplotlib.pyplot as plt data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/co2_gdp_per_capita.csv', index_col=0) data.plot.scatter(x='CO2 per capita', y='GDP per capita') plt.show()
It seems there is some weak correlation – this can also be confirmed by calculating the correlation with data.corr() showing a 0.633178 correlation.
Step 4: Data Presentation
Data Presentation is about making data easy to digest.
Let’s try to make an example.
Assume we want to give a picture of how US CO2 per capita is compared to the rest of the world.
- Let’s take 2017 (as more recent data is incomplete)
- What is the mean, max, and min CO2 per capital in the world
import pandas as pd import matplotlib.pyplot as plt data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0) year = 2017 print(data.loc[year]['USA'])
This gives 14.8.
How can we tell a story?
- US is above the mean
- US is not the max
- It is above 75%
ax = data.loc[year].plot.hist(bins=15, facecolor='green') ax.set_xlabel('CO2 per capita') ax.set_ylabel('Number of countries') ax.annotate("USA", xy=(15, 5), xytext=(15, 30), arrowprops=dict(arrowstyle="->", connectionstyle="arc3"))
This is one way to tell a story.
Want to learn more?
Want to learn more about Data Science to become a successful Data Scientist?
This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.
- 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
- 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
- 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).