Data Science

Master Data Visualization for 3 Purposes as a Data Scientist with Full Code Examples

Unleash the Power of Data Visualization as a Data Scientist

Why It’s Great to Master Data Visualization:

  • Data Quality Assurance: Visualization can play a crucial role in identifying and addressing data quality issues. By mastering data visualization techniques, you can visually identify faulty and wrongly formatted data, ensuring the accuracy and reliability of your analyses.
  • Data Exploration: Visualization empowers you to gain deeper insights and a better understanding of your data. By leveraging visualization tools, you can visually analyze patterns, trends, and relationships within your dataset, uncovering hidden insights that may not be evident through raw data alone.
  • Effective Data Presentation: Data visualization is a powerful tool for presenting your findings to stakeholders. By mastering data visualization, you can effectively communicate your data-driven insights, making complex information more accessible and engaging to a wider audience.

Topics Covered in This Tutorial

  1. Data Quality: Explore how visualization can help identify faulty and wrongly formatted data. Learn techniques to visually identify and rectify data quality issues using real-life examples.
  2. Data Exploration: Dive into the world of data exploration with visualization. Discover how visualization techniques can enhance your understanding of complex datasets, enabling you to discover patterns, outliers, and correlations.
  3. Data Presentation: Harness the power of data visualization to present your findings effectively. Learn how to use visualizations to confirm key insights and create compelling narratives that resonate with your audience.

But First, Understand the Power of Data Visualization:

  • Gain insights into why data visualization is a powerful tool that every data scientist should master. Explore the benefits of visualizing data and understand how it can enhance your data analysis capabilities.
Watch tutorial

Step 1: The Power of Data Visualization

Let’s consider some data – you get it by the following code.

import pandas as pd

sample = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/sample_corr.csv')
print(sample)

The output is – can you spot any connection?

Let’s try to visualize the same data.

Matplotlib is an easy to use visualization library for Python.

import matplotlib.pyplot as plt

sample.plot.scatter(x='x', y='y')
plt.show()

Giving the following output.

And here it is easy to spot that there is some kind of correlation. And actually, you would be able to absorb this connection no matter how many datapoint was there, given the data has the same nature.

What Data Visualization gives

  • Absorb information quickly
  • Improve insights
  • Make faster decisions

Step 2: Data Quality with Visualization

Data Quality is something many like to talk about – but unfortunately there is no precise universal definition of it and is rather context specific.

That said, it is a concept you need to understand as a Data Scientist.

In general Data Quality is about (not only)

  • Missing data (often represented as NA-values)
  • Wrong data (data which cannot be used)
  • Different scaled data (e.g. data in different units without being specified)

Sometimes Data Quality is mixed with aspects witch included Data Wrangling – that could be extracting values from string representations.

Data Quality requires that you know something about the data.

Imagine we are considering a dataset of human heights in centimers. Then we can check that in a histogram.

data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/sample_height.csv')
data.plot.hist()
plt.show()

We immediately realize that some of the data is not correct.

Then we can get the data below 50 cm as follows.

print(data[data['height'] < 50])

Looking at that data, you might realize that could be data inserted in meters and not centimeters. This happens often if data is entered by humans, some might wrongly type in meters and not centimers.

This could mean, that the data is valid, it just need re-scaling.

Another example is checking for outliers – in this case wrong data.

Consider this dataset of human age.

data = pd.read_csv('files/sample_age.csv')
data.plot.hist()
plt.show()

And you see someone with age around 300.

Similarly, you can get it with data[data[‘age’] > 150] and see one of age 314 years.

As you realize that Data Visualization helps you fast to get an idea of Data Quality.

Step 3: Data Exploration with Data Visualization

We already get an idea that Data Visualization helps us to absorb information quickly, but not only that, it also helps us to improve insights in the data enabling us to make faster decisions.

Now we will consider data from the World Bank (The World Bank is a great source of datasets).

Let’s consider the dataset EN.ATM.CO2E.PC.

Now let’s consider some typical Data Visualizations. To get started, we need to read the data.

import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0)

print(data.head)

We see that each year has a row, and each column represent a country (we only see part of them here).

Simple plot

To create a simple plot you can apply the following on a DataFrame.

data['USA'].plot()
plt.show()

A great thing about this is how simple it is to create.

Adding a title and labels is straight forward.

  • title='Tilte' adds the title
  • xlabel='X label' adds or changes the X-label
  • ylabel='X label' adds or changes the Y-label
data['USA'].plot(title='CO2 per capita in USA', ylabel='CO2 per capita')
plt.show()

Another thing you can do is adding ranges to the axis.

  • xlim=(min, max) or xlim=min Sets the x-axis range
  • ylim=(min, max) or ylim=min Sets the y-axis range
data['USA'].plot(title='CO2 per capita in USA', ylabel='CO2 per capita', ylim=0)
plt.show()

If you want to compare two columns in the DataFrame you can do it as follows.

data[['USA', 'WLD']].plot(ylim=0)
plt.show()

If you want to set the figure size of the plot, this can be done as follows.

  • figsize=(width, height) in inches
data[['USA', 'DNK', 'WLD']].plot(ylim=0, figsize=(20,6))
plt.show()

Bar Plot

You can create a bar plot as follows.

  • .plot.bar() Create a bar plot
data['USA'].plot.bar(figsize=(20,6))
plt.show()

Bar plot with two columns.

data[['USA', 'WLD']].plot.bar(figsize=(20,6))
plt.show()

Plot a range.

  • .loc[from:to] apply this on the DataFrame to get a range (both inclusive)
data[['USA', 'WLD']].loc[2000:].plot.bar(figsize=(20,6))
plt.show()

Histograms

You can create a histogram as follows.

  • .plot.hist() Create a histogram
  • bins=<number of bins> Specify the number of bins in the histogram.
data['USA'].plot.hist(figsize=(20,6), bins=7)
plt.show()

Pie Chart

You create a Pie Chart as follows.

  • .plot.pie() Creates a Pie Chart
df = pd.Series(data=[3, 5, 7], index=['Data1', 'Data2', 'Data3'])
df.plot.pie()
plt.show()

You can add values counts to your Pie Chart

  • A simple chart of values above/below a threshold
  • .value_counts() Counts occurences of values in a Series (or DataFrame column)
  • A few arguments to .plot.pie()
    • colors=<list of colors>
    • labels=<list of labels>
    • title='<title>'
    • ylabel='<label>'
    • autopct='%1.1f%%' sets percentages on chart
(data['USA'] < 17.5).value_counts().plot.pie(colors=['r', 'g'], labels=['>=17.5', '<17.5'], title='CO2 per capita', autopct='%1.1f%%')
plt.show()

Scatter Plot

Assume we want to investigate if GDP per capita and CO2 per capita are correlated. Then a great way to get an idea about is by using a scatter plot.

Let’s try to do that. The data is available we just need to load it.

import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/co2_gdp_per_capita.csv', index_col=0)

data.plot.scatter(x='CO2 per capita', y='GDP per capita')
plt.show()

It seems there is some weak correlation – this can also be confirmed by calculating the correlation with data.corr() showing a 0.633178 correlation.

Step 4: Data Presentation

Data Presentation is about making data easy to digest.

Let’s try to make an example.

Assume we want to give a picture of how US CO2 per capita is compared to the rest of the world.

Preparation

  • Let’s take 2017 (as more recent data is incomplete)
  • What is the mean, max, and min CO2 per capital in the world
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0)

year = 2017
print(data.loc[year]['USA'])

This gives 14.8.

How can we tell a story?

  • US is above the mean
  • US is not the max
  • It is above 75%
ax = data.loc[year].plot.hist(bins=15, facecolor='green')

ax.set_xlabel('CO2 per capita')
ax.set_ylabel('Number of countries')
ax.annotate("USA", xy=(15, 5), xytext=(15, 30), 
            arrowprops=dict(arrowstyle="->",
                            connectionstyle="arc3"))

This is one way to tell a story.

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

In the next lesson you will learn how to Get Started with pandas for Data Science in this Data Science course.

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
Rune

Recent Posts

Build and Deploy an AI App

Build and Deploy an AI App with Python Flask, OpenAI API, and Google Cloud: In…

5 days ago

Building Python REST APIs with gcloud Serverless

Python REST APIs with gcloud Serverless In the fast-paced world of application development, building robust…

5 days ago

Accelerate Your Web App Development Journey with Python and Docker

App Development with Python using Docker Are you an aspiring app developer looking to level…

6 days ago

Data Science Course Made Easy: Unlocking the Path to Success

Why Value-driven Data Science is the Key to Your Success In the world of data…

2 weeks ago

15 Machine Learning Projects: From Beginner to Pro

Harnessing the Power of Project-Based Learning and Python for Machine Learning Mastery In today's data-driven…

2 weeks ago

Unlock the Power of Python: 17 Project-Based Lessons from Zero to Machine Learning

Is Python the right choice for Machine Learning? Should you learn Python for Machine Learning?…

2 weeks ago