Why It’s Great to Master Data Visualization:
But First, Understand the Power of Data Visualization:
Let’s consider some data – you get it by the following code.
import pandas as pd
sample = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/sample_corr.csv')
print(sample)
The output is – can you spot any connection?
Let’s try to visualize the same data.
Matplotlib is an easy to use visualization library for Python.
import matplotlib.pyplot as plt
sample.plot.scatter(x='x', y='y')
plt.show()
Giving the following output.
And here it is easy to spot that there is some kind of correlation. And actually, you would be able to absorb this connection no matter how many datapoint was there, given the data has the same nature.
Data Quality is something many like to talk about – but unfortunately there is no precise universal definition of it and is rather context specific.
That said, it is a concept you need to understand as a Data Scientist.
In general Data Quality is about (not only)
Sometimes Data Quality is mixed with aspects witch included Data Wrangling – that could be extracting values from string representations.
Data Quality requires that you know something about the data.
Imagine we are considering a dataset of human heights in centimers. Then we can check that in a histogram.
data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/sample_height.csv')
data.plot.hist()
plt.show()
We immediately realize that some of the data is not correct.
Then we can get the data below 50 cm as follows.
print(data[data['height'] < 50])
Looking at that data, you might realize that could be data inserted in meters and not centimeters. This happens often if data is entered by humans, some might wrongly type in meters and not centimers.
This could mean, that the data is valid, it just need re-scaling.
Another example is checking for outliers – in this case wrong data.
Consider this dataset of human age.
data = pd.read_csv('files/sample_age.csv')
data.plot.hist()
plt.show()
And you see someone with age around 300.
Similarly, you can get it with data[data[‘age’] > 150] and see one of age 314 years.
As you realize that Data Visualization helps you fast to get an idea of Data Quality.
We already get an idea that Data Visualization helps us to absorb information quickly, but not only that, it also helps us to improve insights in the data enabling us to make faster decisions.
Now we will consider data from the World Bank (The World Bank is a great source of datasets).
Let’s consider the dataset EN.ATM.CO2E.PC.
Now let’s consider some typical Data Visualizations. To get started, we need to read the data.
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0)
print(data.head)
We see that each year has a row, and each column represent a country (we only see part of them here).
To create a simple plot you can apply the following on a DataFrame.
data['USA'].plot()
plt.show()
A great thing about this is how simple it is to create.
Adding a title and labels is straight forward.
title='Tilte'
adds the titlexlabel='X label'
adds or changes the X-labelylabel='X label'
adds or changes the Y-label
data['USA'].plot(title='CO2 per capita in USA', ylabel='CO2 per capita')
plt.show()
Another thing you can do is adding ranges to the axis.
xlim=(min, max)
or xlim=min
Sets the x-axis rangeylim=(min, max)
or ylim=min
Sets the y-axis range
data['USA'].plot(title='CO2 per capita in USA', ylabel='CO2 per capita', ylim=0)
plt.show()
If you want to compare two columns in the DataFrame you can do it as follows.
data[['USA', 'WLD']].plot(ylim=0)
plt.show()
If you want to set the figure size of the plot, this can be done as follows.
figsize=(width, height)
in inches
data[['USA', 'DNK', 'WLD']].plot(ylim=0, figsize=(20,6))
plt.show()
You can create a bar plot as follows.
.plot.bar()
Create a bar plot
data['USA'].plot.bar(figsize=(20,6))
plt.show()
Bar plot with two columns.
data[['USA', 'WLD']].plot.bar(figsize=(20,6))
plt.show()
Plot a range.
.loc[from:to]
apply this on the DataFrame to get a range (both inclusive)
data[['USA', 'WLD']].loc[2000:].plot.bar(figsize=(20,6))
plt.show()
You can create a histogram as follows.
.plot.hist()
Create a histogrambins=<number of bins>
Specify the number of bins in the histogram.
data['USA'].plot.hist(figsize=(20,6), bins=7)
plt.show()
You create a Pie Chart as follows.
.plot.pie()
Creates a Pie Chart
df = pd.Series(data=[3, 5, 7], index=['Data1', 'Data2', 'Data3'])
df.plot.pie()
plt.show()
You can add values counts to your Pie Chart
.value_counts()
Counts occurences of values in a Series (or DataFrame column).plot.pie()
colors=<list of colors>
labels=<list of labels>
title='<title>'
ylabel='<label>'
autopct='%1.1f%%'
sets percentages on chart
(data['USA'] < 17.5).value_counts().plot.pie(colors=['r', 'g'], labels=['>=17.5', '<17.5'], title='CO2 per capita', autopct='%1.1f%%')
plt.show()
Assume we want to investigate if GDP per capita and CO2 per capita are correlated. Then a great way to get an idea about is by using a scatter plot.
Let’s try to do that. The data is available we just need to load it.
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/co2_gdp_per_capita.csv', index_col=0)
data.plot.scatter(x='CO2 per capita', y='GDP per capita')
plt.show()
It seems there is some weak correlation – this can also be confirmed by calculating the correlation with data.corr() showing a 0.633178 correlation.
Data Presentation is about making data easy to digest.
Let’s try to make an example.
Assume we want to give a picture of how US CO2 per capita is compared to the rest of the world.
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0)
year = 2017
print(data.loc[year]['USA'])
This gives 14.8.
How can we tell a story?
ax = data.loc[year].plot.hist(bins=15, facecolor='green')
ax.set_xlabel('CO2 per capita')
ax.set_ylabel('Number of countries')
ax.annotate("USA", xy=(15, 5), xytext=(15, 30),
arrowprops=dict(arrowstyle="->",
connectionstyle="arc3"))
This is one way to tell a story.
Want to learn more about Data Science to become a successful Data Scientist?
In the next lesson you will learn how to Get Started with pandas for Data Science in this Data Science course.
This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.
Build and Deploy an AI App with Python Flask, OpenAI API, and Google Cloud: In…
Python REST APIs with gcloud Serverless In the fast-paced world of application development, building robust…
App Development with Python using Docker Are you an aspiring app developer looking to level…
Why Value-driven Data Science is the Key to Your Success In the world of data…
Harnessing the Power of Project-Based Learning and Python for Machine Learning Mastery In today's data-driven…
Is Python the right choice for Machine Learning? Should you learn Python for Machine Learning?…