Data
Science was referred to as the “sexiest job of the 21st century”.
Terabytes of data is produced everyday, and it is time to take action!
Many people are trying to break in one of the data-related fields;
however, with lots of mixing up and confusion between the subfields and
lots of available resources on the web, one might get lost on where to
start. Many people end up learning general set of skills and become more
into data science generalists.
This
is why we decided to create this article which helps you discover the
main data-related fields and choose the one that best suits you. We also
summarized all the competencies required for each sub-field so you
would have an action plan of what do next!
The roadmap here covers the four
most frequent jobs in data and the required skills for each one. We
will cover high-level details to help you discover what skills you are
still lacking.
Data Science
Data
science can be best described as the “Art of dealing with data”. As a
data scientist, You are not simply using a programmatic tool to reach
point B from point A; However, you start by defining point A then start
drawing all the possible paths from this points, explore your input
data, put assumptions, state hypotheses formally, test your hypothesis
using different statistical and mathematical tools, design and apply
experiments if needed, evaluate the current cycle, develop some
programmatic tools if needed and more..
Data Science has three main components :
1. Machine learning & computer science skills
2. Math and statistics
3. Domain related knowledge
1. Machine learning & computer science skills
2. Math and statistics
3. Domain related knowledge
Data
science can be practiced by different stacks of technology and tools.
Here, we’ll start by listing the required skills in the python stack.
Skills required in the Python track
- Familiarity with Numpy, pandas, sklearn, and matplotlib.
- Strong
SQL skills, No-SQL skills are highly required too. That includes
designing normalized schemas, good indexing technique, and writing
efficient queries. - Data cleaning
- Good data visualization skills(tools like tableau or libraries like matplotlib, seaborn, Bookeh, etc )
- Statistical analysis skills. This includes familiarity with the different statistical questions types.
- Experiment design and statistical testing(parametric and non-parametric testing)
- Familiarity with big data frameworks/ infrastructures (spark, hive, Hadoop, mongo, etc)
- Machine learning skills(skill level varies widely based on the
business logic) - Strong understanding of the full cycle of data science(stating a sharp question, exploratory data analysis, inference, formal statistical modeling, interpretation, and communication)
- Story telling skills (powerpoint, etc)
Data
science is a very broad field, usually you’d need to acquire new skills
based on the task you are being assigned (how to build recommender
systems, sequence modeling, etc) I only covered the essential skill set.
Data Analysis
Data Analysis is basically about answering a business related question using data. This question can be:
- descriptive: You are simply describing the data sample you have and its related statistics. you are not interested in data outside your sample.
- exploratory: You are exploring different patterns, trends in the data, seasonality, relationships, and distribution. usually done using exploratory data analysis visualization tools.
- inferential: You are trying to infer some question answer about the data based on the sample you have using hypothesis testing and different statical testing techniques.
- predictive: You are using different statistical tools to extrapolate some values based on some variables like predicting revenue, new users behavior, etc.
- causal: This type of questions usually requires running one or more experiment to test for a causality factor between two or more variables.
- mechanistic: This one questions the underlying link between two sets of variables. It is usually hard to uncover in an uncontrolled environment.
Data analysis can be considered as a subfield of data science usually
for professional with no or little technical background. It usually requires statistics, and domain related experience.
for professional with no or little technical background. It usually requires statistics, and domain related experience.
Up till now, most data analyst use tools like SPSS and similar ones; however,
there has been a new trend into hiring data analyst with skills in R/ python
since they have more powerful tools in predictive analytics and big data.
there has been a new trend into hiring data analyst with skills in R/ python
since they have more powerful tools in predictive analytics and big data.
Skills required in the Python track
- Familiarity with Numpy, pandas, sklearn, and matplotlib
- Strong SQL skills. No-SQL skills are highly required too. Normally
this includes writing efficient queries. - Good data visualization skills(tools like tableau, or libraries like
matplotlib, seaborn, etc ) - Statistical analysis skills
- Experiment design and statistical testing
- Understanding of basic predictive analytics tools like regression
models and clustering, cohort analysis, etc. - Strong understanding of the full cycle of data science(stating a sharp question, exploratory data analysis, inference, formal statistical modeling, interpretation, and communication)
Machine Learning Engineering:
Machine
learning is the field of AI we use to automate processes that usually
require human intelligence to do specially in vision and language. ML is
the subfield of AI that applies that using data. There are other
non-data centric approaches in AI.
Machine learning is the most technical intensive track out of them.
It requires a range of technical skills like writing efficient queries, efficient learning algorithms(in time and accuracy)
It requires a range of technical skills like writing efficient queries, efficient learning algorithms(in time and accuracy)
Skills required in the Python track:
- Familiarity with Numpy, pandas, sklearn, and matplotlib
- Strong SQL, No-SQL skills are essential.
- Good data visualization skills(tools like tableau, or libraries like matplotlib, seaborn, etc )
- Familiarity with big data frameworks/ infrastructures (spark, hive,
Hadoop, mongo, etc) - Strong understanding of basic ml algorithms (regressions,
classification, clustering, and dimensionality reduction) - Feature Engineering and hyper-parameter fine tuning
- Strong intuition of the different optimization algorithms and when to use each one.
- Structuring and Evaluating ML algorithms
- Understanding different neural networks structures and new viral architectures.
- Reinforcement learning
- Strong familiarity with one or more of tge Deep learning frameworks(Tensorflow, keras, caffe, or torch, etc)
- Network analysis
Data Engineering
Data
engineering is the field that cares about building data pipelines and
infrastructure. This job is crucial to any company that has huge amount
of data and planning to acquire a data scientist. Usually, hiring a data
engineer comes before hiring a data scientist.
Skills required in the Python track:
- In depth knowledge of SQL and noSQL solutions
- System architecture skills
- ETL and other data warehousing tools for efficient data storage
and retrieval - Familiarity with different AWS or any cloud services for data lakes,
data warehousing, etc - Big data based analytics(i.e. frameworks on top of mongo or
Hadoop like spark, hive, mapreduce) - Basic understanding of Data modeling , ML, and statistical
analysis. - Building efficient data pipelines
After
all, all these fields are pretty new in industry and not yet well
established. That’s why you need to keep up with the new skills, viral
architectures, papers, etc.
We
will follow up with another post about the best recommended online
courses and degrees to learn each skill and a quick dive into each one
of those bullet points.
No comments:
Post a Comment