Learn how you can become a Python programmer in just 12 weeks.

    We respect your privacy. Unsubscribe at anytime.

    Master the Data Science Workflow Blueprint to Get Measurable Data Driven Impact

    Unleash the Power of Data Science: Master the Data Science Workflow

    Why It’s Great to Master Data Science:

    • Data-driven Decision Making: Data science empowers you to make informed decisions by leveraging insights from large and complex datasets. Mastering data science techniques equips you with the skills to extract valuable information and uncover patterns, trends, and correlations within data, enabling you to drive data-driven decision making.
    • Solving Real-world Problems: Data science allows you to tackle real-world challenges by applying analytical and computational methods to extract actionable insights. By mastering the data science workflow, you can effectively address complex problems and provide practical solutions across various domains.
    • Unlocking Opportunities: Data science is in high demand across industries, offering abundant career opportunities. By mastering data science, you position yourself for success in a rapidly evolving field, opening doors to exciting roles such as data analyst, data scientist, or machine learning engineer.

    Topics Covered in This Tutorial

    1. Why Data Science: Understand the significance and benefits of data science in today’s data-driven world. Explore the value of leveraging data for decision making and problem-solving.
    2. Problem Understanding: Learn how to approach a problem as a data scientist. Gain insights into problem formulation, data requirements, and goal definition to ensure effective data analysis.
    3. The Data Science Workflow: Dive into the step-by-step data science workflow, which encompasses data acquisition, data exploration, data cleaning, feature engineering, model building, evaluation, and deployment.
    4. Practical Example: Apply the data science workflow to solve a Student Grade Prediction problem. Utilize Python and pandas to explore and analyze the data, perform necessary preprocessing, build predictive models, and evaluate their performance.
    5. Career Opportunities: Discover the vast range of career opportunities in the field of data science. Explore the skills and knowledge required to excel in data-driven roles and embark on a successful data science career journey.

    Unlock the potential of data science and embark on a rewarding journey as a skilled data scientist. Master the data science workflow and unleash the power of data-driven insights.

    Watch tutorial

    Part 1: Why Data Science?

    Did you know you check your phone 58 times per day?

    Let’s say you are awake 16 hours – that is, you check your phone every 17 minutes during all your waking hours.

    Estimates approximate that 66% of all smartphone users are addicted to their phones.

    Does that surprise you?

    How do we know that?


    We live in a world where you know that the above statements are possibly not wild guesses, there is data to confirm them.

    This tutorial is not about helping your phone addiction – it is about Data Science.

    With a world full of data you can learn just about anything, make your own analysis and understand the aspects better. You can help make data driven decisions, to avoid blind guesses.

    This is one reason to love Data Science.

    How did Data Science start?

    Part 2: Understanding the problem in Data Science

    The key to success in Data Science is understanding the problem. Get the right question.

    What is the problem we try to solve? This will form the Data Science Problem.


    • Sales figure and call center logs: evaluate a new product
    • Sensor data from multiple sensors: detect equipment failure
    • Customer data + marketing data: better targeted marketing

    Part of understanding the problem included to asses the situation – this will help you understand your context, your problem better.

    In the end, it is all about defining the object of your Data Science research. What is the success criteria?

    The key to a successful Data Science project is to understand the object and success criteria, this will guide you in your search to understand the research better.

    Part 3: Data Science Workflow

    Most get Data Science wrong!

    At least, at first.

    Deadly wrong!

    The assume – not to blame them – that Data Science is about knowing the most tools to solve the problem.

    This series of tutorials will teach you something different.

    The key to a successful Data Scientist is to understand the Data Science Workflow.

    Data Science Workflow

    Looking at the above flow – you will realize, that most beginners only focus on a narrow aspect of it.

    That is a big mistake – the real value is in step 5, where you use the insight to make measurable goals from data driven insights.

    Let’s take an example of how a simple Data Science Workflow could be.

    • Step 1
      • Problem: Predict weather tomorrow
      • Data: Time series on Temperateture, Air pressure, Humidity, Rain, Wind speed, Wind direction, etc.
      • Import: Collect data from sources
    • Step 2
      • Explore: Data quality
      • Visualize: A great way to understand data
      • Cleaning: Handle missing or faulty data
    • Step 3
    • Step 4
      • Present: Weather forecast
      • Visualize: Charts, maps, etc.
      • Credibility: Inaccurate results, too high confidence, not presenting full findings
    • Step 5
      • Insights: What to wear, impact on outside events, etc.
      • Impact: Sales and weather forecast (umbrella, ice cream, etc.)
      • Main goal: This is what makes Data Science valuable

    Now, while this looks straight forward – the can be many iterations back into a previous step. Even at step 5, you can consult the client and realize you need more data and start another iteration from step 1, to enrich the process again.

    Part 4: Student Grade Prediction

    To get started with a simple project, we will explore the Portuguese high school student dataset from Kaggle.

    It consists of features and targets.

    The features are column data for each student. That is, each studen as a row in the dataset, and each row has data for each of the features.


    The the target is what we want to predict from student data.

    That is, given a row of features, can we predict the targets.


    Here we will look at a smaller problem.

    Problem: Propose activities to improve G3 grades.

    Our Goal

    • To guide the school how they helps students getting higher grades

    Yes – we need to explore the data and get ideas on how to help the students to get higher grades.

    Now, let’s explore our Data Science Workflow.

    Step 1: Acquire

    • Explore problem
    • Identify data
    • Import data

    Get the right questions

    • This forms the data science problem
    • What is the problem

    We need to understand a bit about the context.

    Understand context

    • Student age?
    • What is possible?
    • What is the budget?

    We have an idea about these things, not exact figures, but we have an idea about the age (high school students). This tells us what kind of activities we should propose. If it were kids in age 8-10 years, we should propose something different.

    What is possible – well, your imagination must guide you with your rational mind. Also, what is the budget – we cannot propose ideas which are too expensive for a normal high school budget.

    Let’s get started with some code, to get acquainted with the data.

    import pandas as pd
    data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/student-mat.csv')

    We will see it has 395 students in the dataset.


    This will show the first 5 lines of the dataset as well as the columns. The columns contains the feature and targets.

    Step 2: Prepare

    • Explore data
    • Visualize ideas
    • Cleaning data

    This step is also about understand if data quality is as expected. We will learn a lot more about this later.

    For now explore the data types of the columns.


    This will print out the data types. We see some are integers (int64) others are objects (that is strings/text in this case).

    school        object
    sex           object
    age            int64
    address       object
    famsize       object
    Pstatus       object
    Medu           int64
    Fedu           int64
    Mjob          object
    Fjob          object
    reason        object
    guardian      object
    traveltime     int64
    studytime      int64
    failures       int64
    schoolsup     object
    famsup        object
    paid          object
    activities    object
    nursery       object
    higher        object
    internet      object
    romantic      object
    famrel         int64
    freetime       int64
    goout          int64
    Dalc           int64
    Walc           int64
    health         int64
    absences       int64
    G1             int64
    G2             int64
    G3             int64
    dtype: object

    And if there are any missing values.


    The output below tells us (all the False values) that there is no missing data.

    school        False
    sex           False
    age           False
    address       False
    famsize       False
    Pstatus       False
    Medu          False
    Fedu          False
    Mjob          False
    Fjob          False
    reason        False
    guardian      False
    traveltime    False
    studytime     False
    failures      False
    schoolsup     False
    famsup        False
    paid          False
    activities    False
    nursery       False
    higher        False
    internet      False
    romantic      False
    famrel        False
    freetime      False
    goout         False
    Dalc          False
    Walc          False
    health        False
    absences      False
    G1            False
    G2            False
    G3            False
    dtype: bool

    Step 3: Analyze

    • Feature selection
    • Model selection
    • Analyze data

    We are interested to see what has impact on end grades (G3). We can use correlation for that.

    For now, correlation is just a number saying if something is correlated or not.

    A correlation number is between (including both) -1 and 1. If close to -1 or 1 (that is not close to 0), then it is correlated.

    age          -0.161579
    Medu          0.217147
    Fedu          0.152457
    traveltime   -0.117142
    studytime     0.097820
    failures     -0.360415
    famrel        0.051363
    freetime      0.011307
    goout        -0.132791
    Dalc         -0.054660
    Walc         -0.051939
    health       -0.061335
    absences      0.034247
    G1            0.801468
    G2            0.904868
    G3            1.000000
    Name: G3, dtype: float64

    This shows us to learnings.

    First of all, the grades G1, G2, and G3 are highly correlated, while almost non of the others are.

    Second, it only considers the numeric features.

    But how can we use non-numeric features you might ask.

    Let’s consider the feature higher (wants to take higher education (binary: yes or no)).


    This gives.

    no      6.800
    yes    10.608
    Name: G3, dtype: float64

    This shows that this is a good indicator of whether a student gets good or bad grades. That is, if we assume the questions were asked in the beginning at high school, you can say that students answering no will get 6.8, while students answering yes till get 10.6 on average (grades are in range 0 – 20).

    That is a big indicator.

    But how many are in each group?

    You can get that by.


    Resulting in.

    no      20
    yes    375
    Name: G3, dtype: int64

    Now, that is not many. But maybe this is good enough. Finding 20 students which we really can help improve grades.

    Later we will learn more about standard deviation, but for now we leave our analysis at this.

    Step 4: Report

    • Present findings
    • Visualize results
    • Credibility counts

    This is about how to present our results. We have learned nothing visual yet, so we will keep it simple.

    We cannot do much more than present the findings.

    higher mean grades
    no 6.800
    yes 10.608

    higher count
    no 20
    yes 375

    I am sure you can make a nicer power point presentation than this.

    Step 5: Actions

    • Use insights
    • Measure impact
    • Main goal

    Now this is where we need to find ideas. We have identified 20 students, now we need to find activities that the high school can use to improve.

    This is where I will let it be your ideas.

    How can you measure?

    Well, one way is to collect the same data each year and see if the activities have impact.

    Now, you can probably do better than I did. Hence, I encourage you to play around with the dataset and find better indicators to get ideas to awesome activities.

    Want to learn more?

    Want to learn more about Data Science to become a successful Data Scientist?

    In the next lesson you will learn how to Master Data Visualization for 3 Purposes as a Data Scientist in this Data Science course.

    This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

    • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
    • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
    • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).

    Python for Finance: Unlock Financial Freedom and Build Your Dream Life

    Discover the key to financial freedom and secure your dream life with Python for Finance!

    Say goodbye to financial anxiety and embrace a future filled with confidence and success. If you’re tired of struggling to pay bills and longing for a life of leisure, it’s time to take action.

    Imagine breaking free from that dead-end job and opening doors to endless opportunities. With Python for Finance, you can acquire the invaluable skill of financial analysis that will revolutionize your life.

    Make informed investment decisions, unlock the secrets of business financial performance, and maximize your money like never before. Gain the knowledge sought after by companies worldwide and become an indispensable asset in today’s competitive market.

    Don’t let your dreams slip away. Master Python for Finance and pave your way to a profitable and fulfilling career. Start building the future you deserve today!

    Python for Finance a 21 hours course that teaches investing with Python.

    Learn pandas, NumPy, Matplotlib for Financial Analysis & learn how to Automate Value Investing.

    “Excellent course for anyone trying to learn coding and investing.” – Lorenzo B.

    Leave a Comment