11 Useful pandas Charts with One Line of Code

What will you learn?

When trying to understand data, visualization is the key to fast understand it!

Data visualization has 3 purposes.

  1. Data Quality: Finding outliers and missing data.
  2. Data Exploration: Understand the data.
  3. Data Presentation: Present the result.

Here you will learn 11 useful charts to understand your data and they are done in one line of code.

The data we will work with

We need some data to work with.

You can either download the Notebook and csv-file (GitHub repo) or read it directly from repository as follows.

import pandas as pd
import matplotlib.pyplot as plt
file_url = 'https://raw.githubusercontent.com/LearnPythonWithRune/pandas_charts/main/air_quality.csv'
data = pd.read_csv(file_url, index_col=0, parse_dates=True)
print(data)

This will output the first 5 lines of the data.

Now let’s use the data we have in the DataFrame data.

If you are new to pandas, I suggest you get an understanding of them from this guide.

#1 Simple plot

A simple plot is the default to use unless you know what you want. It will demonstrate the nature of the data.

Let’s try to do it here.

data.plot()

As you notice, there are three columns of data for the 3 stations: Antwerp, Paris, and London.

The data is a datetime series, meaning, that each data point is part of a time series (the x-axis).

It is a bit difficult to see if station Antwerp has a full dataset.

Let’s try to figure that out.

#2 Isolated plot

This leads us to making an isolated plot of only one column. This is handy to understand each individual column of data better.

Here we were a bit curious about if the data of station Antwerp was given for all dates.

data['station_antwerp'].plot()

This shows that our suspicion was correct. The time series is not covering the full range for station Antwerp.

This tells us about the data quality, which might be crucial for further analysis.

You can do the same for the other two columns.

#3 Scatter Plot

A great way to see if there is a correlation of data, is to make a scatter plot.

Let’s demonstrate how that looks like.

data.plot.scatter(x='station_london', y='station_paris', alpha=.25)

You see that data is not totally scattered all over, but is not fully correlated either. This means, that there is come weak correlation of the data and it is not fully independent of each other.

#4 Box Plot

One way to understand data better is by a box plot. It might need a bit of understanding of simple statistics.

Let’s first take a look at it.

data.plot.box()

The box plot shows the following.

To understand what outliers, min, median, max, and so forth means, I would suggest you read this simple statistic guide.

#5 Area Plot

An area plot can show you the data in a great way to see how the values follow each other in a visual easy way to get an understanding of values, correlation, and missing data.

data.plot.area(figsize=(12,4), subplots=True)

#6 Bar plots

Bar plots can be useful, but often when the data is more limited.

Here you see a bar plot of the first 15 rows of data.

data.iloc[:15].plot.bar()

#7 Histograms for single column

Histograms will show you what data is most common. It shows the frequencies of data divided into bins. By default there are 10 bins of data.

It is an amazing tool to get a fast view of the number of occurrences of each data range.

Here first for an isolated station.

data['station_paris'].plot.hist()

#8 Histograms for multiple columns

Then for all three stations, where you see it with transparency (alpha).

data.plot.hist(alpha=.5)

#9 Pie

Pie charts are very powerful, when you want to show a division of data.

How many percentage belong to each category.

Here you see the mean value of each station.

data.mean().plot.pie()

#10 Scatter Matrix Plot

This is a great tool for showing data for combined in all possible ways. This will show you correlations and how data is distributed.

You need to import an additional library, but it gives you fast understanding of data.

from pandas.plotting import scatter_matrix
scatter_matrix(data, alpha=0.2, figsize=(6, 6))

#11 Secondary y-axis

Finally, sometimes you want two plots on the same chart. The problem can be, that the two plots have very different ranges. hence, you would like to have two different y-axes, with different ranges.

This will enable you to have plots on the same chart with different ranges.

data['station_london'].plot()
data['station_paris'].cumsum().plot(secondary_y=True)

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

Then check my free Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).

15 Most Useful pandas Shortcut Methods

What will you learn?

Everybody likes the pandas data structure DataFrame, but most miss out on what powerful methods it provides.

pandas is a huge module, which makes it difficult to master. Most just use the data structure (DataFrame), without utilizing the power of the methods. In this tutorial you will learn the 15 most useful shortcut that will help you when working with data in pandas data structures.

#1 groupby

The groupby method involves some combination of splitting the object, applying a function, and combining the result.

Wow. That sounds complex. But it is not. It can be used to group large amounts of data and compute operations on these groups.

Best way to learn is to see some example.

import pandas as pd
data = {'Items': ['Apple','Orange', 'Pear', 'Orange', 'Apple'], 
        'Price': [12, 5, 3, 7, 24]}
df = pd.DataFrame(data)

This results in this DataFrame.

The groupby method can group the items together, and apply a function. Let’s try it here.

df.groupby(['Items']).mean()

This will result in this output.

As you see, it has grouped the Apples, Oranges, and the Pears together and for the price column, it has applied the mean() function on the values.

Hence, the Apple has value 18, as it is the mean of 12 and 24 ((12 + 24)/2). Similar, for Orange and Pear.

#2 memory_usage()

We get more and more data and our project get bigger and bigger. At one point you will need to analyze how much memory your data is using.

What memory_usage does, is, it returns the memory usage of each column in the DataFrame. Sometimes, the data type of a column is object, what that means is, that it is pointing to another object. To get the data usage of these objects included, you need to use the deep=True argument.

Let’s try both, to see the difference.

import pandas as pd
dtypes = ['int64', 'float64', 'complex128', 'object', 'bool']
data = dict([(t, np.ones(shape=1000, dtype=int).astype(t)) for t in
                                                          dtypes])
df = pd.DataFrame(data)
print(df.head())

Then we can get the memory usage as follows.

print(df.memory_usage())

Giving the following.

Index           128
int64          8000
float64        8000
complex128    16000
object         8000
bool           1000
dtype: int64

Also, with deep=True.

df.memory_usage(deep=True)

Giving the following where you see the object column is uses more space.

Index           128
int64          8000
float64        8000
complex128    16000
object        36000
bool           1000
dtype: int64

#3 clip()

clip() can trim values at the input threshold.

I find this is easiest to understand by inspecting an example.

import pandas as pd
data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]}
df = pd.DataFrame(data)
print(df)

Then we apply the clip, which will ensure the values below -2 are replaced with -2, and values above 5, are replaced with 5. It clips the values.

print(df.clip(-2, 5))

#4 corr()

The correlation between the values in a column can be calculate with corr(). There are different methods to use: Pearson, Kendall, and Spearman. By default it uses the Pearson method, which will do fine giving you an idea if columns are correlated.

Let’s try an example.

import pandas as pd
df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],
                  columns=['dogs', 'cats'])

The correlation is given by.

print(df.corr())

The value 1.0 is saying it is perfect correlation, which are shown in the diagonal. This makes sense, as the diagonal is the column with itself.

To learn more about correlation and statistics, be sure to check this tutorial out, which also explains the correlation value and how to interpret it.

#5 argmin()

The name argmin is a bit strange. What it does, it returns the position (the index) of the smallest value in a Series (column of a DataFrame).

import pandas as pd
s = pd.Series({'Corn Flakes': 100.0, 'Almond Delight': 110.0,
               'Cinnamon Toast Crunch': 120.0, 'Cocoa Puff': 110.0})
print(s)

Gives.

Corn Flakes              100.0
Almond Delight           110.0
Cinnamon Toast Crunch    120.0
Cocoa Puff               110.0
dtype: float64

And to get the position of the smallest value, just apply the method.

print(s.argmin())

Which will give 0. Remember that it is zero-index, meaning that the first element has index 0.

#6 argmax()

Just like argmin, then argmax() returns the largest element in a Series.

Continue with the example from above.

print(s.argmax())

This will give 2, as it is the largest element in the series.

#7 compare()

Want to know the differences between DataFrames? Then compare does a great job at that.

import pandas as pd
import numpy as np
df = pd.DataFrame(
     {
         "col1": [1.0, 2.0, 3.0, np.nan, 5.0],
         "col2": [1.0, 2.0, 3.0, 4.0, 5.0]
     },
     columns=["col1", "col2"],
)

We can compare the columns here.

df['col1'].compare(df['col2'])

As you see, the only row that differ is the above.

#8 replace()

Did you ever need to replace a value in a DataFrame? Well, it also has a method for that and it is called replace().

df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
                   'B': [5, 6, 7, 8, 9],
                   'C': ['a', 'b', 'c', 'd', 'e']})

Let’s try to replace 5 with -10 and see what happens.

print(df.replace(5, -10))

#9 isna()

Wanted to find missing values? Then isna can do that for you.

Let’s try it.

import pandas as pd
import numpy as np
df = pd.DataFrame(dict(age=[5, 6, np.NaN],
                  born=[pd.NaT, pd.Timestamp('1939-05-27'),
                        pd.Timestamp('1940-04-25')],
                  name=['Alfred', 'Batman', ''],
                  toy=[None, 'Batmobile', 'Joker']))

Then you get the values as follows.

print(df.isna())

I often use it also in a combination with sum(), which will then tell how many rows in each column are missing. This is interesting to get an idea about the quality of the dataset.

print(df.isna().sum())
age     1
born    1
name    0
toy     1
dtype: int64

#10 interpolation()

On the subject of missing values, what to do? Well, there are many options, but one simple can be to interpolate the values.

import pandas as pd
import numpy as np
s = pd.Series([0, 1, np.nan, 3])

This gives the following series.

0    0.0
1    1.0
2    NaN
3    3.0
dtype: float64

Then you can interpolate and get the value between them.

print(s.interpolate())
0    0.0
1    1.0
2    2.0
3    3.0
dtype: float64

This is just one way to deal with it. Dealing with missing values is a big subject. To learn more read this tutorial on the subject.

#11 drop()

Ever needed to remove a column in a DataFrame? Well, again they made a method for that.

Let’s try the drop() method to remove a column.

import pandas as pd
data = {'Age': [-44,0,5, 15, 10, -3], 
        'Salary': [0,5,-2, -14, 19, 24]}
df = pd.DataFrame(data)

Then let’s remove the Age column.

df2 = df.drop('Age', axis='columns')
print(df2)

Notice, that it returns a new DataFrame.

#12 drop_duplicates()

Dealing with data that has duplicate rows? Well, it is a common problem and pandas made a method to easily remove them from your DataFrame.

It is called drop_duplicates and does what it says.

Let’s try it.

import pandas as pd
df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 15, 5]
})

This DataFrame as duplicate rows. Let’s see how they can be removed.

df2 = df.drop_duplicates()
print(df2)

#13 sum()

Ever needed to sum a column? Even with multi index?

Let’s try.

import pandas as pd
idx = pd.MultiIndex.from_arrays([
    ['warm', 'warm', 'cold', 'cold'],
    ['dog', 'falcon', 'fish', 'spider']],
    names=['blooded', 'animal'])
s = pd.Series([4, 2, 0, 8], name='legs', index=idx)
print(s)

This will output.

blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
In [29]:

Then this will sum the column.

print(s.sum())

And it will output 14, as expected.

#14 cumsum()

Wanted to make a cumulative sum? Then cumsum() does the job for you, even with missing numbers.

import pandas as pd
s = pd.Series([2, np.nan, 5, -1, 0])
print(s)

This will give.

0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

And then.

print(s.cumsum())

Gives.

0    2.0
1    NaN
2    7.0
3    6.0
4    6.0
dtype: float64

Where it makes a cumulative sum down the column.

#15 value_counst()

The value_counts() method returns the number of unique rows in a DataFrame.

This requires an example to really understand.

df = pd.DataFrame({'num_legs': [2, 4, 4, 6],
                   'num_wings': [2, 0, 0, 0]},
                  index=['falcon', 'dog', 'cat', 'ant'])

Here we see we have two rows with 4 and 0 (in that order), while the other rows have unique values.

print(df.value_counts())
num_legs  num_wings
4         0            2
2         2            1
6         0            1
dtype: int64

We see there are two rows with 4 and 0, and one of the other rows.

Bonus: unique()

Wanted the unique elements in your Series?

Here you go.

import pandas as pd
s = pd.Series([2, 1, 3, 3], name='A')
print(s.unique())

This will give the unique elements.

array([2, 1, 3])

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

Then check my free Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).

The Ultimate Data Science Workflow Template

What will you learn?

Data Science Workflow

When it comes to creating a good Data Science Project you will need to ensure you cover a great deal of aspects. This template will show you what to cover and where to find more information on a specific topic.

The common pitfall for most junior Data Scientist is to focus on the very technical part of the Data Science Workflow. To add real value to the clients you need to focus on more steps, which are often neglected.

This guide will walk you through all steps and elaborate and link to in-depth content if you need more explanations.

Step 1: Acquire

  • Explore problem
  • Identify data
  • Import data

Step 1.a: Define Problem

If you are making a hoppy project, there might not be a definition of what you are trying to solve. But it is always good practice to start with it. Otherwise, you will most likely just do what you usually do and feel comfortable about. Try to sit down and figure it out.

It should be clear, that this step is before you have the data. That said, it often happens that a company has data and doesn’t know what to use it for.

Still, it all starts by defining a problem.

Here are some guidelines.

  • When defining a problem, don’t be too ambitious
    • Examples:
      • A green energy windmill producer need to optimize distribution and need better prediction on production based on weather forecasts
      • An online news media is interested in a story with how CO2 per capita around the world has evolved over the years
    • Both projects are difficult
      • For the windmill we would need data on production, maintenance periods, detailed weather data, just to get started.
      • The data for CO2 per capita is available on World Bank, but creating a visual story is difficult with our current capabilities
  • Hence, make a better research problem
    • You can start by considering a dataset and get inspiration
    • Examples of datasets
    • Example of Problem
      • What is the highest rated movie genre?

Data Science: Understanding the Problem

  • Get the right question:
    • What is the problem we try to solve?
    • This forms the Data Science problem
    • Examples
      • Sales figure and call center logs: evaluate a new product
      • Sensor data from multiple sensors: detect equipment failure
      • Customer data + marketing data: better targeted marketing
  • Assess situation
    • Risks, Benefits, Contingencies, Regulations, Resources, Requirement
  • Define goal
    • What is the objective?
    • What is the success criteria?
  • Conclusion
    • Defining the problem is key to successful Data Science projects

Step 1.b: Import libraries

When you work on project, you need somewhere have the data. A great place to start is by using pandas.

If you work in a JuPyter notebook you can run this in a cell to get started and follow this guide.

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Step 1.c: Identify the Data

Great Places to Find Data

Step 1.d: Import Data

Read CSV files (Learn more here)

Excel files  (Learn more here)

  • Most videly used spreadsheet
  • Learn more about Excel processing in this lecture
  • read_excel() Read an Excel file into a pandas DataFrame.data = pd.read_excel('files/aapl.xlsx', index_col='Date')

Parquet files  (Learn more here)

  • Parquet is a free open source format
  • Compressed format
  • read_parquet() Load a parquet object from the file path, returning a DataFrame.data = pd.read_parquet('files/aapl.parquet')

Web Scraping (Learn more here)

  • Extracting data from websites
  • Leagal issues: wikipedia.org
  • read_html() Read HTML tables into a list of DataFrame objects.url = "https://en.wikipedia.org/wiki/Wikipedia:Fundraising_statistics" data = pd.read_html(url)

Databases (Learn more here)

  • read_sql() Read SQL query or database table into a DataFrame.
  • The sqlite3 is an interface for SQLite databases.import sqlite3 import pandas as pd conn = sqlite3.connect('files/dallas-ois.sqlite') data = pd.read_sql('SELECT * FROM officers', conn)

Step 1.e: Combine data

Also see guide here.

  • Often you need to combine data
  • Often we need to combine data from different sources

pandas DataFrames

  • pandas DataFrames can combine data (pandas cheat sheet)
  • concat([df1, df2], axis=0)concat Concatenate pandas objects along a particular axis 
  • df.join(other.set_index('key'), on='key')join Join columns of another DataFrame.
  • df1.merge(df2, how='inner', on='a') merge Merge DataFrame or named Series objects with a database-style join

Step 2: Prepare

  • Explore data
  • Visualize ideas
  • Cleaning data

Step 2.a: Explore data

  • head() Return the first n rows.
  • .shape Return a tuple representing the dimensionality of the DataFrame.
  • .dtypes Return the dtypes in the DataFrame.
  • info() Print a concise summary of a DataFrame.
  • describe() Generate descriptive statistics.
  • isna().any() Returns if any element is missing.

Step 2.b: Groupby, Counts and Statistics

Read the guide on statistics here.

  • Count groups to see the significance across resultsdata.groupby('Gender').count()
  • Return the mean of the values over the requested axis.data.groupby('Gender').mean()
  • Standard Deviation
    • Standard deviation is a measure of how dispersed (spread) the data is in relation to the mean.
    • Low standard deviation means data is close to the mean.
    • High standard deviation means data is spread out.
  • data.groupby('Gender').std()
  • Box plots
    • Box plots is a great way to visualize descriptive statistics
    • Notice that Q1: 25%, Q2: 50%, Q3: 75%
  • Make a box plot of the DataFrame columns plot.box()
data.boxplot()

Step 2.c: Visualize data

Read the guide on visualization for data science here.

Simple Plot

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv('files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0)
data['USA'].plot()
  • Adding title and labels
    • title='Tilte' adds the title
    • xlabel='X label' adds or changes the X-label
    • ylabel='X label' adds or changes the Y-labeldata['USA'].plot(title='US CO2 per capita', ylabel='CO2 (metric tons per capita)')
  • Adding ranges
    • xlim=(min, max) or xlim=min Sets the x-axis range
    • ylim=(min, max) or ylim=min Sets the y-axis rangedata['USA'].plot(title='US CO2 per capita', ylabel='CO2 (metric tons per capita)', ylim=0)
  • Comparing datadata[['USA', 'WLD']].plot(ylim=0)

Scatter Plot

  • Good to see any connectiondata = pd.read_csv('files/sample_corr.csv') data.plot.scatter(x='x', y='y')

Histogram

  • Identifying qualitydata = pd.read_csv('files/sample_height.csv') data.plot.hist()
  • Identifying outliersdata = pd.read_csv('files/sample_age.csv') data.plot.hist()
  • Setting bins and figsizedata = pd.read_csv('files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0) data['USA'].plot.hist(figsize=(20,6), bins=10)

Bar Plot

  • Normal plotdata = pd.read_csv('files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0) data['USA'].plot.bar()
  • Range and columns, figsize and labeldata[['USA', 'DNK']].loc[2000:].plot.bar(figsize=(20,6), ylabel='CO emmission per capita')

Pie Chart

  • Presentingdf = pd.Series(data=[3, 5, 7], index=['Data1', 'Data2', 'Data3']) df.plot.pie()
  • Value counts in Pie Charts
    • colors=<list of colors>
    • labels=<list of labels>
    • title='<title>'
    • ylabel='<label>'
    • autopct='%1.1f%%' sets percentages on chart(data['USA'] < 17.5).value_counts().plot.pie(colors=['r', 'g'], labels=['>= 17.5', '< 17.5'], title='CO2', autopct='%1.1f%%')

Step 2.d: Clean data

Read the data cleaning guide here.

  • dropna() Remove missing values.
  • fillna() Fill NA/NaN values using the specified method.
    • Example: Fill missing values with mean.data = data.fillna(data.mean())
  • drop_duplicates() Return DataFrame with duplicate rows removed.
  • Working with time series
    • reindex() Conform Series/DataFrame to new index with optional filling logic.
    • interpolate() Fill NaN values using an interpolation method.
  • Resources

Step 3: Analyze

  • Feature selection
  • Model selection
  • Analyze data

Step 3.a: Split into Train and Test

For an introduction to Machine Learning read this guide.

  • Assign dependent features (those predicting) to X
  • Assign classes (labels/independent features) to y
  • Divide into training and test setsfrom sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 3.b: Feature Scaling

Learn about Feature Scaling in this guide.

  • Feature Scaling transforms values in the similar range for machine learning algorithms to behave optimal.
  • Feature Scaling can be a problems for Machine Learing algorithms on multiple features spanning in different magnitudes.
  • Feature Scaling can also make it is easier to compare resultsFeature Scaling Techniques
  • Normalization is a special case of MinMaxScaler
    • Normalization: Converts values between 0-1(values - values.min())/(values.max() - values.min())
    • MinMaxScaler: Between any values
  • Standardization (StandardSclaer from sklearn)
    • Mean: 0, StdDev: 1(values - values.mean())/values.std()
    • Less sensitive to outliers

Normalization

  • MinMaxScaler Transform features by scaling each feature to a given range.
  • MinMaxScaler().fit(X_train) is used to create a scaler.
    • Notice: We only do it on training datafrom sklearn.preprocessing import MinMaxScaler norm = MinMaxScaler().fit(X_train) X_train_norm = norm.transform(X_train) X_test_norm = norm.transform(X_test)

Standarization

  • StandardScaler Standardize features by removing the mean and scaling to unit variance.from sklearn.preprocessing import StandardScaler scale = StandardScaler().fit(X_train) X_train_stand = scale.transform(X_train) X_test_stand = scale.transform(X_test)

Step 3.c: Feature Selection

Learn about Feature Selection in this guide.

  • Feature selection is about selecting attributes that have the greatest impact towards the problem you are solving.

Why Feature Selection?

  • Higher accuracy
  • Simpler models
  • Reducing overfitting risk

Feature Selection Techniques

Filter methods
  • Independent of Model
  • Based on scores of statistical
  • Easy to understand
  • Good for early feature removal
  • Low computational requirements
Examples
Wrapper methods
  • Compare different subsets of features and run the model on them
  • Basically a search problem
Examples

See more on wikipedia

Embedded methods
  • Find features that contribute most to the accuracy of the model while it is created
  • Regularization is the most common method – it penalizes higher complexity
Examples

Remove constant and quasi constant features

  • VarianceThreshold Feature selector that removes all low-variance features.from sklearn.feature_selection import VarianceThreshold sel = VarianceThreshold() sel.fit_transform(data)Remove correlated features
  • The goal is to find and remove correlated features
  • Calcualte correlation matrix (assign it to corr_matrix)
  • A feature is correlated to any previous features if the following is true
    • Notice that we use correlation 0.8corr_features = [feature for feature in corr_matrix.columns if (corr_matrix[feature].iloc[:corr_matrix.columns.get_loc(feature)] > 0.8).any()]

Step 3.d: Model Selection

Learn about Model Selection in this guide.

  • The process of selecting the model among a collection of candidates machine learning models

Problem type

  • What kind of problem are you looking into?
    • ClassificationPredict labels on data with predefined classes
      • Supervised Machine Learning
    • ClusteringIdentify similarieties between objects and group them in clusters
      • Unsupervised Machine Learning
    • RegressionPredict continuous values
      • Supervised Machine Learning
  • Resource: Sklearn cheat sheet

Model Selection Techniques

  • Probabilistic Measures: Scoring by performance and complexity of model.
  • Resampling Methods: Splitting in sub-train and sub-test datasets and scoring by mean values of repeated runs.

A few models

  • LinearRegression Ordinary least squares Linear Regression (Lesson 08).from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score lin = LinearRegression() lin.fit(X_train, y_train) y_pred = lin.predict(X_test) r2_score(y_test, y_pred)
  • SVC C-Support Vector Classification (Lesson 10).from sklearn.svm import SVC, LinearSVC from sklearn.metrics import accuracy_score svc = LinearSVC() svc.fit(X_train, y_train) y_pred = svc.predict(X_test) accuracy_score(y_test, y_pred)
  • KNeighborsClassifier Classifier implementing the k-nearest neighbors vote (Lesson 10).from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score neigh = KNeighborsClassifier() neigh.fit(X_train.fillna(-1), y_train) y_pred = neigh.predict(X_test.fillna(-1)) accuracy_score(y_test, y_pred)

Step 3.e: Analyze Result

This is the main check-point of your analysis.

  • Review the Problem and Data Science problem you started with.
    • The analysis should add value to the Data Science Problem
    • Sometimes our focus drifts – we need to ensure alignment with original Problem.
    • Go back to the Exploration of the Problem – does the result add value to the Data Science Problem and the initial Problem (which formed the Data Science Problem)
    • Example: As Data Scientist we often find the research itself valuable, but a business is often interested in increasing revenue, customer satisfaction, brand value, or similar business metrics.
  • Did we learn anything?
    • Does the Data-Driven Insights add value?
    • Example: Does it add value to have evidence for: Wealthy people buy more expensive cars.
      • This might add you value to confirm this hypothesis, but does it add any value for car manufacturer?
  • Can we make any valuable insights from our analysis?
    • Do we need more/better/different data?
    • Can we give any Actionable Data Driven Insights?
    • It is always easy to want better and more accurate high quality data.
  • Do we have the right features?
    • Do we need eliminate features?
    • Is the data cleaning appropriate?
    • Is data quality as expected?
  • Do we need to try different models?
    • Data Analysis is an iterative process
    • Simpler models are more powerful
  • Can result be inconclusive?
    • Can we still give recommendations?

Quote

“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”

  • Sherlock Holmes

Iterative Research Process

  • Observation/Question: Starting point (could be iterative)
  • Hypothesis/Claim/Assumption: Something we believe could be true
  • Test/Data collection: We need to gether relevant data
  • Analyze/Evidence: Based on data collection did we get evidence?
    • Can our model predict? (a model is first useful when it can predict)
  • ConcludeWarning! E.g.: We can conclude a correlation (this does not mean A causes B)
    • Example: Based on the collected data we can see a correlation between A and B

Step 4: Report

  • Present findings
  • Visualize results
  • Credibility counts

Step 4.a: Present Findings

  • You need to sell or tell a story with the findings.
  • Who is your audience?
    • Focus on technical level and interest of your audience
    • Speak their language
    • Story should make sense to audience
    • Examples
      • Team manager: Might be technical, but often busy and only interested in high-level status and key findings.
      • Data engineer/science team: Technical exploration and similar interest as you
      • Business stakeholders: This might be end-customers or collaboration in other business units.
  • When presenting
    • Goal: Communicate actionable insights to key stakeholders
    • Outline (inspiration):
      • TL;DR (Too-long; Didn’t read) – clear and concise summary of the content (often one line) that frames key insights in the context of impact on key business metrics.
      • Start with your understanding of the business problem
      • How does it transform into a Data Science Problem
      • How will to measure impact – what business metrics are indicators of results
      • What data is available and used
      • Presenting hypthosis of reseach
      • A visual presentation of the insights (model/analysis/key findings)
        • This is where you present the evidence for the insights
      • How to use insight and create actions
      • Followup and continuous learning increasing value

Step 4.b: Visualize Results

  • Telling a story with the data
  • This is where you convince that the findings/insights are correct
  • The right visualization is important
    • Example: A correlation matrix might give a Data Engineer insights in how findings where discovered, but confuse business partners.

Resources for visualization

  • Seaborn Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
  • Plotly open-source for analytic apps in Python
  • Folium makes it easy to visualize data that’s been manipulated in Python on an interactive leaflet map.

Step 4.c: Credibility Counts

  • This is the check point if your research is valid
    • Are you hiding findings you did not like (not supporting your hypothesis)?
    • Remember it is the long-term relationship that counts
  • Don’t leave out results
    • We learn from data and find hidden patterns, to make data-driven decisions, with a long-term perspective

Step 5: Actions

  • Use insights
  • Measure impact
  • Main goal

Step 5.a: Use Insights

  • How do we follow up on the presented Insights?
  • No one-size-fits-all: It depends on the Insights and Problem
  • Examples:
    1. Problem: What customers are most likely to cancel subscription?
      • Say, we have insufficient knowledge of customers, and need to get more, hence we have given recommendations to gather more insights
      • But you should still try to add value
    2. Problem: Here is our data – find valuable insights!
      • This is a challenge as there is no given focus
      • An iterative process involving the customer can leave you with no surprises

Step 5.b: Measure Impact

  • If customer cannot measure impact of your work – they do not know what they pay for.
    • If you cannot measure it – you cannot know if hypothesis are correct.
    • A model is first valuable when it can be used to predict with some certainty
  • There should be identified metrics/indicators to evaluate in the report
  • This can evolve – we learn along the way – or we could be wrong.
  • How long before we expect to see impact on identified business metrics?
  • What if we do not see expected impact?
  • Understanding of metrics
    • The metrics we measure are indicators that our hypthesis is correct
    • Other aspects can have impact on the result – but you need to identify that

Main Goal

  • Your success of a Data Scientist is to create valuable actionable insights

A great way to think

  • Any business/organisation can be thought of as a complex system
    • Nobody understands it perfectly and it evolves organically
  • Data describes some aspect of it
  • It can be thought of as a black-box
  • Any insights you can bring is like a window that sheds light on what happens inside

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).