The Ultimate Data Science Workflow Template

What will you learn?

Data Science Workflow

When it comes to creating a good Data Science Project you will need to ensure you cover a great deal of aspects. This template will show you what to cover and where to find more information on a specific topic.

The common pitfall for most junior Data Scientist is to focus on the very technical part of the Data Science Workflow. To add real value to the clients you need to focus on more steps, which are often neglected.

This guide will walk you through all steps and elaborate and link to in-depth content if you need more explanations.

Step 1: Acquire

  • Explore problem
  • Identify data
  • Import data

Step 1.a: Define Problem

If you are making a hoppy project, there might not be a definition of what you are trying to solve. But it is always good practice to start with it. Otherwise, you will most likely just do what you usually do and feel comfortable about. Try to sit down and figure it out.

It should be clear, that this step is before you have the data. That said, it often happens that a company has data and doesn’t know what to use it for.

Still, it all starts by defining a problem.

Here are some guidelines.

  • When defining a problem, don’t be too ambitious
    • Examples:
      • A green energy windmill producer need to optimize distribution and need better prediction on production based on weather forecasts
      • An online news media is interested in a story with how CO2 per capita around the world has evolved over the years
    • Both projects are difficult
      • For the windmill we would need data on production, maintenance periods, detailed weather data, just to get started.
      • The data for CO2 per capita is available on World Bank, but creating a visual story is difficult with our current capabilities
  • Hence, make a better research problem
    • You can start by considering a dataset and get inspiration
    • Examples of datasets
    • Example of Problem
      • What is the highest rated movie genre?

Data Science: Understanding the Problem

  • Get the right question:
    • What is the problem we try to solve?
    • This forms the Data Science problem
    • Examples
      • Sales figure and call center logs: evaluate a new product
      • Sensor data from multiple sensors: detect equipment failure
      • Customer data + marketing data: better targeted marketing
  • Assess situation
    • Risks, Benefits, Contingencies, Regulations, Resources, Requirement
  • Define goal
    • What is the objective?
    • What is the success criteria?
  • Conclusion
    • Defining the problem is key to successful Data Science projects

Step 1.b: Import libraries

When you work on project, you need somewhere have the data. A great place to start is by using pandas.

If you work in a JuPyter notebook you can run this in a cell to get started and follow this guide.

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Step 1.c: Identify the Data

Great Places to Find Data

Step 1.d: Import Data

Read CSV files (Learn more here)

Excel files  (Learn more here)

  • Most videly used spreadsheet
  • Learn more about Excel processing in this lecture
  • read_excel() Read an Excel file into a pandas DataFrame.data = pd.read_excel('files/aapl.xlsx', index_col='Date')

Parquet files  (Learn more here)

  • Parquet is a free open source format
  • Compressed format
  • read_parquet() Load a parquet object from the file path, returning a DataFrame.data = pd.read_parquet('files/aapl.parquet')

Web Scraping (Learn more here)

  • Extracting data from websites
  • Leagal issues: wikipedia.org
  • read_html() Read HTML tables into a list of DataFrame objects.url = "https://en.wikipedia.org/wiki/Wikipedia:Fundraising_statistics" data = pd.read_html(url)

Databases (Learn more here)

  • read_sql() Read SQL query or database table into a DataFrame.
  • The sqlite3 is an interface for SQLite databases.import sqlite3 import pandas as pd conn = sqlite3.connect('files/dallas-ois.sqlite') data = pd.read_sql('SELECT * FROM officers', conn)

Step 1.e: Combine data

Also see guide here.

  • Often you need to combine data
  • Often we need to combine data from different sources

pandas DataFrames

  • pandas DataFrames can combine data (pandas cheat sheet)
  • concat([df1, df2], axis=0)concat Concatenate pandas objects along a particular axis 
  • df.join(other.set_index('key'), on='key')join Join columns of another DataFrame.
  • df1.merge(df2, how='inner', on='a') merge Merge DataFrame or named Series objects with a database-style join

Step 2: Prepare

  • Explore data
  • Visualize ideas
  • Cleaning data

Step 2.a: Explore data

  • head() Return the first n rows.
  • .shape Return a tuple representing the dimensionality of the DataFrame.
  • .dtypes Return the dtypes in the DataFrame.
  • info() Print a concise summary of a DataFrame.
  • describe() Generate descriptive statistics.
  • isna().any() Returns if any element is missing.

Step 2.b: Groupby, Counts and Statistics

Read the guide on statistics here.

  • Count groups to see the significance across resultsdata.groupby('Gender').count()
  • Return the mean of the values over the requested axis.data.groupby('Gender').mean()
  • Standard Deviation
    • Standard deviation is a measure of how dispersed (spread) the data is in relation to the mean.
    • Low standard deviation means data is close to the mean.
    • High standard deviation means data is spread out.
  • data.groupby('Gender').std()
  • Box plots
    • Box plots is a great way to visualize descriptive statistics
    • Notice that Q1: 25%, Q2: 50%, Q3: 75%
  • Make a box plot of the DataFrame columns plot.box()
data.boxplot()

Step 2.c: Visualize data

Read the guide on visualization for data science here.

Simple Plot

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv('files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0)
data['USA'].plot()
  • Adding title and labels
    • title='Tilte' adds the title
    • xlabel='X label' adds or changes the X-label
    • ylabel='X label' adds or changes the Y-labeldata['USA'].plot(title='US CO2 per capita', ylabel='CO2 (metric tons per capita)')
  • Adding ranges
    • xlim=(min, max) or xlim=min Sets the x-axis range
    • ylim=(min, max) or ylim=min Sets the y-axis rangedata['USA'].plot(title='US CO2 per capita', ylabel='CO2 (metric tons per capita)', ylim=0)
  • Comparing datadata[['USA', 'WLD']].plot(ylim=0)

Scatter Plot

  • Good to see any connectiondata = pd.read_csv('files/sample_corr.csv') data.plot.scatter(x='x', y='y')

Histogram

  • Identifying qualitydata = pd.read_csv('files/sample_height.csv') data.plot.hist()
  • Identifying outliersdata = pd.read_csv('files/sample_age.csv') data.plot.hist()
  • Setting bins and figsizedata = pd.read_csv('files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0) data['USA'].plot.hist(figsize=(20,6), bins=10)

Bar Plot

  • Normal plotdata = pd.read_csv('files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0) data['USA'].plot.bar()
  • Range and columns, figsize and labeldata[['USA', 'DNK']].loc[2000:].plot.bar(figsize=(20,6), ylabel='CO emmission per capita')

Pie Chart

  • Presentingdf = pd.Series(data=[3, 5, 7], index=['Data1', 'Data2', 'Data3']) df.plot.pie()
  • Value counts in Pie Charts
    • colors=<list of colors>
    • labels=<list of labels>
    • title='<title>'
    • ylabel='<label>'
    • autopct='%1.1f%%' sets percentages on chart(data['USA'] < 17.5).value_counts().plot.pie(colors=['r', 'g'], labels=['>= 17.5', '< 17.5'], title='CO2', autopct='%1.1f%%')

Step 2.d: Clean data

Read the data cleaning guide here.

  • dropna() Remove missing values.
  • fillna() Fill NA/NaN values using the specified method.
    • Example: Fill missing values with mean.data = data.fillna(data.mean())
  • drop_duplicates() Return DataFrame with duplicate rows removed.
  • Working with time series
    • reindex() Conform Series/DataFrame to new index with optional filling logic.
    • interpolate() Fill NaN values using an interpolation method.
  • Resources

Step 3: Analyze

  • Feature selection
  • Model selection
  • Analyze data

Step 3.a: Split into Train and Test

For an introduction to Machine Learning read this guide.

  • Assign dependent features (those predicting) to X
  • Assign classes (labels/independent features) to y
  • Divide into training and test setsfrom sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 3.b: Feature Scaling

Learn about Feature Scaling in this guide.

  • Feature Scaling transforms values in the similar range for machine learning algorithms to behave optimal.
  • Feature Scaling can be a problems for Machine Learing algorithms on multiple features spanning in different magnitudes.
  • Feature Scaling can also make it is easier to compare resultsFeature Scaling Techniques
  • Normalization is a special case of MinMaxScaler
    • Normalization: Converts values between 0-1(values - values.min())/(values.max() - values.min())
    • MinMaxScaler: Between any values
  • Standardization (StandardSclaer from sklearn)
    • Mean: 0, StdDev: 1(values - values.mean())/values.std()
    • Less sensitive to outliers

Normalization

  • MinMaxScaler Transform features by scaling each feature to a given range.
  • MinMaxScaler().fit(X_train) is used to create a scaler.
    • Notice: We only do it on training datafrom sklearn.preprocessing import MinMaxScaler norm = MinMaxScaler().fit(X_train) X_train_norm = norm.transform(X_train) X_test_norm = norm.transform(X_test)

Standarization

  • StandardScaler Standardize features by removing the mean and scaling to unit variance.from sklearn.preprocessing import StandardScaler scale = StandardScaler().fit(X_train) X_train_stand = scale.transform(X_train) X_test_stand = scale.transform(X_test)

Step 3.c: Feature Selection

Learn about Feature Selection in this guide.

  • Feature selection is about selecting attributes that have the greatest impact towards the problem you are solving.

Why Feature Selection?

  • Higher accuracy
  • Simpler models
  • Reducing overfitting risk

Feature Selection Techniques

Filter methods
  • Independent of Model
  • Based on scores of statistical
  • Easy to understand
  • Good for early feature removal
  • Low computational requirements
Examples
Wrapper methods
  • Compare different subsets of features and run the model on them
  • Basically a search problem
Examples

See more on wikipedia

Embedded methods
  • Find features that contribute most to the accuracy of the model while it is created
  • Regularization is the most common method – it penalizes higher complexity
Examples

Remove constant and quasi constant features

  • VarianceThreshold Feature selector that removes all low-variance features.from sklearn.feature_selection import VarianceThreshold sel = VarianceThreshold() sel.fit_transform(data)Remove correlated features
  • The goal is to find and remove correlated features
  • Calcualte correlation matrix (assign it to corr_matrix)
  • A feature is correlated to any previous features if the following is true
    • Notice that we use correlation 0.8corr_features = [feature for feature in corr_matrix.columns if (corr_matrix[feature].iloc[:corr_matrix.columns.get_loc(feature)] > 0.8).any()]

Step 3.d: Model Selection

Learn about Model Selection in this guide.

  • The process of selecting the model among a collection of candidates machine learning models

Problem type

  • What kind of problem are you looking into?
    • ClassificationPredict labels on data with predefined classes
      • Supervised Machine Learning
    • ClusteringIdentify similarieties between objects and group them in clusters
      • Unsupervised Machine Learning
    • RegressionPredict continuous values
      • Supervised Machine Learning
  • Resource: Sklearn cheat sheet

Model Selection Techniques

  • Probabilistic Measures: Scoring by performance and complexity of model.
  • Resampling Methods: Splitting in sub-train and sub-test datasets and scoring by mean values of repeated runs.

A few models

  • LinearRegression Ordinary least squares Linear Regression (Lesson 08).from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score lin = LinearRegression() lin.fit(X_train, y_train) y_pred = lin.predict(X_test) r2_score(y_test, y_pred)
  • SVC C-Support Vector Classification (Lesson 10).from sklearn.svm import SVC, LinearSVC from sklearn.metrics import accuracy_score svc = LinearSVC() svc.fit(X_train, y_train) y_pred = svc.predict(X_test) accuracy_score(y_test, y_pred)
  • KNeighborsClassifier Classifier implementing the k-nearest neighbors vote (Lesson 10).from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score neigh = KNeighborsClassifier() neigh.fit(X_train.fillna(-1), y_train) y_pred = neigh.predict(X_test.fillna(-1)) accuracy_score(y_test, y_pred)

Step 3.e: Analyze Result

This is the main check-point of your analysis.

  • Review the Problem and Data Science problem you started with.
    • The analysis should add value to the Data Science Problem
    • Sometimes our focus drifts – we need to ensure alignment with original Problem.
    • Go back to the Exploration of the Problem – does the result add value to the Data Science Problem and the initial Problem (which formed the Data Science Problem)
    • Example: As Data Scientist we often find the research itself valuable, but a business is often interested in increasing revenue, customer satisfaction, brand value, or similar business metrics.
  • Did we learn anything?
    • Does the Data-Driven Insights add value?
    • Example: Does it add value to have evidence for: Wealthy people buy more expensive cars.
      • This might add you value to confirm this hypothesis, but does it add any value for car manufacturer?
  • Can we make any valuable insights from our analysis?
    • Do we need more/better/different data?
    • Can we give any Actionable Data Driven Insights?
    • It is always easy to want better and more accurate high quality data.
  • Do we have the right features?
    • Do we need eliminate features?
    • Is the data cleaning appropriate?
    • Is data quality as expected?
  • Do we need to try different models?
    • Data Analysis is an iterative process
    • Simpler models are more powerful
  • Can result be inconclusive?
    • Can we still give recommendations?

Quote

“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”

  • Sherlock Holmes

Iterative Research Process

  • Observation/Question: Starting point (could be iterative)
  • Hypothesis/Claim/Assumption: Something we believe could be true
  • Test/Data collection: We need to gether relevant data
  • Analyze/Evidence: Based on data collection did we get evidence?
    • Can our model predict? (a model is first useful when it can predict)
  • ConcludeWarning! E.g.: We can conclude a correlation (this does not mean A causes B)
    • Example: Based on the collected data we can see a correlation between A and B

Step 4: Report

  • Present findings
  • Visualize results
  • Credibility counts

Step 4.a: Present Findings

  • You need to sell or tell a story with the findings.
  • Who is your audience?
    • Focus on technical level and interest of your audience
    • Speak their language
    • Story should make sense to audience
    • Examples
      • Team manager: Might be technical, but often busy and only interested in high-level status and key findings.
      • Data engineer/science team: Technical exploration and similar interest as you
      • Business stakeholders: This might be end-customers or collaboration in other business units.
  • When presenting
    • Goal: Communicate actionable insights to key stakeholders
    • Outline (inspiration):
      • TL;DR (Too-long; Didn’t read) – clear and concise summary of the content (often one line) that frames key insights in the context of impact on key business metrics.
      • Start with your understanding of the business problem
      • How does it transform into a Data Science Problem
      • How will to measure impact – what business metrics are indicators of results
      • What data is available and used
      • Presenting hypthosis of reseach
      • A visual presentation of the insights (model/analysis/key findings)
        • This is where you present the evidence for the insights
      • How to use insight and create actions
      • Followup and continuous learning increasing value

Step 4.b: Visualize Results

  • Telling a story with the data
  • This is where you convince that the findings/insights are correct
  • The right visualization is important
    • Example: A correlation matrix might give a Data Engineer insights in how findings where discovered, but confuse business partners.

Resources for visualization

  • Seaborn Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
  • Plotly open-source for analytic apps in Python
  • Folium makes it easy to visualize data that’s been manipulated in Python on an interactive leaflet map.

Step 4.c: Credibility Counts

  • This is the check point if your research is valid
    • Are you hiding findings you did not like (not supporting your hypothesis)?
    • Remember it is the long-term relationship that counts
  • Don’t leave out results
    • We learn from data and find hidden patterns, to make data-driven decisions, with a long-term perspective

Step 5: Actions

  • Use insights
  • Measure impact
  • Main goal

Step 5.a: Use Insights

  • How do we follow up on the presented Insights?
  • No one-size-fits-all: It depends on the Insights and Problem
  • Examples:
    1. Problem: What customers are most likely to cancel subscription?
      • Say, we have insufficient knowledge of customers, and need to get more, hence we have given recommendations to gather more insights
      • But you should still try to add value
    2. Problem: Here is our data – find valuable insights!
      • This is a challenge as there is no given focus
      • An iterative process involving the customer can leave you with no surprises

Step 5.b: Measure Impact

  • If customer cannot measure impact of your work – they do not know what they pay for.
    • If you cannot measure it – you cannot know if hypothesis are correct.
    • A model is first valuable when it can be used to predict with some certainty
  • There should be identified metrics/indicators to evaluate in the report
  • This can evolve – we learn along the way – or we could be wrong.
  • How long before we expect to see impact on identified business metrics?
  • What if we do not see expected impact?
  • Understanding of metrics
    • The metrics we measure are indicators that our hypthesis is correct
    • Other aspects can have impact on the result – but you need to identify that

Main Goal

  • Your success of a Data Scientist is to create valuable actionable insights

A great way to think

  • Any business/organisation can be thought of as a complex system
    • Nobody understands it perfectly and it evolves organically
  • Data describes some aspect of it
  • It can be thought of as a black-box
  • Any insights you can bring is like a window that sheds light on what happens inside

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
Data Science

Leave a Reply Cancel reply

Exit mobile version