Master Data Visualization for 3 Purposes as a Data Scientist with Full Code Examples

What will we cover?

We will investigate the 3 main purposes of Data Visualization as a Data Scientist.

  • Data Quality: We will demonstrate with examples how you can identify faulty and wrongly formatted data with visualization.
  • Data Exploration: This will teach you how to understand data better with visualization.
  • Data Presentation: Here we explore the main purpose new users think of Data Visualization, to present the findings. This will focus on how to use Data Visualization to confirm your key findings.

But first, we will understand the power of Data Visualization – understand why it is such a powerful tool to master.

Step 1: The Power of Data Visualization

Let’s consider some data – you get it by the following code.

import pandas as pd

sample = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/sample_corr.csv')
print(sample)

The output is – can you spot any connection?

Let’s try to visualize the same data.

Matplotlib is an easy to use visualization library for Python.

import matplotlib.pyplot as plt

sample.plot.scatter(x='x', y='y')
plt.show()

Giving the following output.

And here it is easy to spot that there is some kind of correlation. And actually, you would be able to absorb this connection no matter how many datapoint was there, given the data has the same nature.

What Data Visualization gives

  • Absorb information quickly
  • Improve insights
  • Make faster decisions

Step 2: Data Quality with Visualization

Data Quality is something many like to talk about – but unfortunately there is no precise universal definition of it and is rather context specific.

That said, it is a concept you need to understand as a Data Scientist.

In general Data Quality is about (not only)

  • Missing data (often represented as NA-values)
  • Wrong data (data which cannot be used)
  • Different scaled data (e.g. data in different units without being specified)

Sometimes Data Quality is mixed with aspects witch included Data Wrangling – that could be extracting values from string representations.

Data Quality requires that you know something about the data.

Imagine we are considering a dataset of human heights in centimers. Then we can check that in a histogram.

data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/sample_height.csv')
data.plot.hist()
plt.show()

We immediately realize that some of the data is not correct.

Then we can get the data below 50 cm as follows.

print(data[data['height'] < 50])

Looking at that data, you might realize that could be data inserted in meters and not centimeters. This happens often if data is entered by humans, some might wrongly type in meters and not centimers.

This could mean, that the data is valid, it just need re-scaling.

Another example is checking for outliers – in this case wrong data.

Consider this dataset of human age.

data = pd.read_csv('files/sample_age.csv')
data.plot.hist()
plt.show()

And you see someone with age around 300.

Similarly, you can get it with data[data[‘age’] > 150] and see one of age 314 years.

As you realize that Data Visualization helps you fast to get an idea of Data Quality.

Step 3: Data Exploration with Data Visualization

We already get an idea that Data Visualization helps us to absorb information quickly, but not only that, it also helps us to improve insights in the data enabling us to make faster decisions.

Now we will consider data from the World Bank (The World Bank is a great source of datasets).

Let’s consider the dataset EN.ATM.CO2E.PC.

Now let’s consider some typical Data Visualizations. To get started, we need to read the data.

import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0)

print(data.head)

We see that each year has a row, and each column represent a country (we only see part of them here).

Simple plot

To create a simple plot you can apply the following on a DataFrame.

data['USA'].plot()
plt.show()

A great thing about this is how simple it is to create.

Adding a title and labels is straight forward.

  • title='Tilte' adds the title
  • xlabel='X label' adds or changes the X-label
  • ylabel='X label' adds or changes the Y-label
data['USA'].plot(title='CO2 per capita in USA', ylabel='CO2 per capita')
plt.show()

Another thing you can do is adding ranges to the axis.

  • xlim=(min, max) or xlim=min Sets the x-axis range
  • ylim=(min, max) or ylim=min Sets the y-axis range
data['USA'].plot(title='CO2 per capita in USA', ylabel='CO2 per capita', ylim=0)
plt.show()

If you want to compare two columns in the DataFrame you can do it as follows.

data[['USA', 'WLD']].plot(ylim=0)
plt.show()

If you want to set the figure size of the plot, this can be done as follows.

  • figsize=(width, height) in inches
data[['USA', 'DNK', 'WLD']].plot(ylim=0, figsize=(20,6))
plt.show()

Bar Plot

You can create a bar plot as follows.

  • .plot.bar() Create a bar plot
data['USA'].plot.bar(figsize=(20,6))
plt.show()

Bar plot with two columns.

data[['USA', 'WLD']].plot.bar(figsize=(20,6))
plt.show()

Plot a range.

  • .loc[from:to] apply this on the DataFrame to get a range (both inclusive)
data[['USA', 'WLD']].loc[2000:].plot.bar(figsize=(20,6))
plt.show()

Histograms

You can create a histogram as follows.

  • .plot.hist() Create a histogram
  • bins=<number of bins> Specify the number of bins in the histogram.
data['USA'].plot.hist(figsize=(20,6), bins=7)
plt.show()

Pie Chart

You create a Pie Chart as follows.

  • .plot.pie() Creates a Pie Chart
df = pd.Series(data=[3, 5, 7], index=['Data1', 'Data2', 'Data3'])
df.plot.pie()
plt.show()

You can add values counts to your Pie Chart

  • A simple chart of values above/below a threshold
  • .value_counts() Counts occurences of values in a Series (or DataFrame column)
  • A few arguments to .plot.pie()
    • colors=<list of colors>
    • labels=<list of labels>
    • title='<title>'
    • ylabel='<label>'
    • autopct='%1.1f%%' sets percentages on chart
(data['USA'] < 17.5).value_counts().plot.pie(colors=['r', 'g'], labels=['>=17.5', '<17.5'], title='CO2 per capita', autopct='%1.1f%%')
plt.show()

Scatter Plot

Assume we want to investigate if GDP per capita and CO2 per capita are correlated. Then a great way to get an idea about is by using a scatter plot.

Let’s try to do that. The data is available we just need to load it.

import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/co2_gdp_per_capita.csv', index_col=0)

data.plot.scatter(x='CO2 per capita', y='GDP per capita')
plt.show()

It seems there is some weak correlation – this can also be confirmed by calculating the correlation with data.corr() showing a 0.633178 correlation.

Step 4: Data Presentation

Data Presentation is about making data easy to digest.

Let’s try to make an example.

Assume we want to give a picture of how US CO2 per capita is compared to the rest of the world.

Preparation

  • Let’s take 2017 (as more recent data is incomplete)
  • What is the mean, max, and min CO2 per capital in the world
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0)

year = 2017
print(data.loc[year]['USA'])

This gives 14.8.

How can we tell a story?

  • US is above the mean
  • US is not the max
  • It is above 75%
ax = data.loc[year].plot.hist(bins=15, facecolor='green')

ax.set_xlabel('CO2 per capita')
ax.set_ylabel('Number of countries')
ax.annotate("USA", xy=(15, 5), xytext=(15, 30), 
            arrowprops=dict(arrowstyle="->",
                            connectionstyle="arc3"))

This is one way to tell a story.

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
Data Science

Leave a Reply Cancel reply

Exit mobile version