What does it take to become a Data Scientist?
Data Science is in a cross field of different fields.
This means you need a lot of different skills.
By experience, most Data Scientists work in teams, and do not necessarily need to be expert in all areas. Hence, the list of skills is an idea of what you need.
A great way to look at it is in hard and soft skills.
- Math and Statistics
- Domain knowledge
- Data visualization
- Storytelling Skills
- Structured Thinking
I would say, that the soft skills you learn by experience and but you need an interest in them. The hard skills are the ones you need to get good or at least decent at.
Looking at the hard skills, you do not need to master all aspects of it.
Before we dive into the hard skills, let’s also understand what a Data Scientist does.
Math and Statistics
I understand that many get scared of this one and if you take a formal education in Data Science, you will learn a lot of Statistics.
Experience shows, that it is the few specialists that need a high level of statistics as a Data Scientist. That said, you still need to understand some aspects in-depth.
What does that include?
The most important are.
- Count– a descriptive statistics and counts observations. Count is the most used in statistics and has high importance to evaluate findings.
- Example: Making conclusion on childhood weights and the study only had 12 childing (observations). Is that trustworthy?
- The count says something about the quality of the study
- Mean – The average value.
- Standard Deviation – is a measure of how dispersed (spread) the data is in relation to the mean.
- Low standard deviation means data is close to the mean.
- High standard deviation means data is spread out.
Also understanding box-plots.
What correlation means.
You can learn more about it here.
Python is used in the scientific communities for a set of reasons.
- Ease of use and simple syntax.
- Easy to adapt without engineering background.
- Many libraries.
- Wide community.
- General purpose makes it easy to collaborate.
Python is the most popular programming language in the Scientific Community including Data Science. It is a solid choice to learn.
But do you need to master Python programming on a high level?
No, you need to understand Python programming to a simple level where you master the following.
- Basic understanding of programming – how Python code works.
- Variable and Data Types.
- Calculations with simple types
- Loop over Data Objects
- How functions can help you work.
- How methods can be applied on Data Objects.
- How to read and write data.
- Master Data Types: Lists, Dicts, NumPy, DataFrames.
- Use of Machine Learning Models
This sounds like a lot but can be broken down in steps.
Most beginners courses in Python will do fine, while some specialize too much. But what you need to understand and get a feeling of, is how Python code works.
Some common things you learn in Basic Python course.
- Variables and built-in Data Types like Lists and Dicts.
- How to calculate with variables.
- Looping over Data Objects like Lists and Dicts.
- Built-in Python functions to ease your work.
- How to work with files.
Other things you learn, that is good to understand, but not needed to master.
- Object Oriented Programming (OOP) – you need to understand the idea of OOP, as it will help you understand how a computer works, and how it works on your Data Objects.
A great source is this free course.
Learn NumPy and DataFrames
For the most part, you get really far with pandas DataFrames as a Data Scientist. If you understand them and can work with data with them. Then you are really far.
NumPy is an extension on top of DataFrames (even though it is implemented opposite).
But what are DataFrames and NumPy?
They are data structures used to contain the data you work with as a Data Scientist.
A great place to learn about DataFrames is to follow this free course.
The Machine Learning models you create are the one that creates your insights to deliver value to your clients. Therefore you need skills to master them and understand how they work.
There are a lot of models and you don’t need to be an expert in all of them. But it is a great idea to understand them.
A few ones could be.
- k-Nearest-Neighbors Classifier.
- Linear Classifier
- Support Vector Machines
- Linear Regression
- k-Means clustering
- Deep Neural Network (DNN)
- Convolutional Neural Network (CNN)
- Recurrent Neural Network (RNN)
And be knowledgeable in frameworks like.
- Sci-Kit Learn
You can build up your skills in this free course.
This is actually often the key to get a job as a Data Scientist.
If you know a lot about Windmills, power prediction patterns, and so forth. Well, then it will be easier for you to get a job as a Data Scientist for a company predicting power productions by Windmills.
Or you are an expert in weather forecast. You can also, get a job as a Data Scientist for predicting power production by Windmills.
The point is twofold.
First, if you have worked in an industry for a few years, then you have deep domain knowledge about that field. Is there is cross-field where you can apply Data Science? Well, find those jobs and you will have a great edge to get it.
Well, most say that it is easier to train people to make Data Science, that giving them 3-4 years of experience in a Domain.
Take advantage of that.
Second, if you have an interest in some specific area of Data Science. Focus down on it. Become an expert.
Again, having Domain Knowledge is crucial to set yourself apart from the other applicants.
Data Visualization if often misunderstood by beginners in Data Science.
It is actually crucial in 3 different aspects.
- Data Quality: Explore data quality including identifying outliers
- Data Exploration: Understand data with visualizing ideas
- Data Presentation: Present results
Most only focus on the Data Presentation – presenting your findings. While this is an art in itself, most do not fully capture the importance of the other ones at first.
Our human brain is not wired to understand data as digits, but when we see them visually on a chart, we can immediately see and understand it.
Just look at this one.
What is wrong? Well, it looks that some heights are not fitting the other heights.
This tells you something about the Data Quality. Is there something wrong with it?
The chart would tell you something is wrong no matter how many data points you have. But image you had to look through 10,000 of data points manually in a table. That would take hours and you might miss it.
When it comes to exploring data, seeing it visually on a chart shows you patterns.
Again, you would notice that looking at the data in a table.
Finally, data presentation is an art in itself.
Does this one tell you a story?
A great resource to learn about Data Visualization can be found here.
Does that map out a what you need as a Data Scientist?
This gives you the hard skills you need as a Data Scientist.
A great way to think of it is also to understand the Data Science Workflow.
It gives you an idea of what steps a Data Science Project goes through.
If you are new to Data Science a great thing to do is to start on the Data Science Career Track bundle that covers all the above, plus it gives you access to discuss your learnings and troubles with the instructor.