Mesin Belajar: All You Need to Know to Break into the Data World and Machine Learning

https://medium.com/beirut-ai/all-you-need-to-know-to-break-into-the-data-world-and-machine-learning-a2c3bd3879ac

Data Science was referred to as the “sexiest job of the 21st century”. Terabytes of data is produced everyday, and it is time to take action! Many people are trying to break in one of the data-related fields; however, with lots of mixing up and confusion between the subfields and lots of available resources on the web, one might get lost on where to start. Many people end up learning general set of skills and become more into data science generalists.

This is why we decided to create this article which helps you discover the main data-related fields and choose the one that best suits you. We also summarized all the competencies required for each sub-field so you would have an action plan of what do next!

The roadmap here covers the four most frequent jobs in data and the required skills for each one. We will cover high-level details to help you discover what skills you are still lacking.

Data Science

Data science can be best described as the “Art of dealing with data”. As a data scientist, You are not simply using a programmatic tool to reach point B from point A; However, you start by defining point A then start drawing all the possible paths from this points, explore your input data, put assumptions, state hypotheses formally, test your hypothesis using different statistical and mathematical tools, design and apply experiments if needed, evaluate the current cycle, develop some programmatic tools if needed and more..

Data Science has three main components :
1. Machine learning & computer science skills
2. Math and statistics
3. Domain related knowledge

Data science can be practiced by different stacks of technology and tools. Here, we’ll start by listing the required skills in the python stack.

Skills required in the Python track

Familiarity with Numpy, pandas, sklearn, and matplotlib.
Strong SQL skills, No-SQL skills are highly required too. That includes designing normalized schemas, good indexing technique, and writing
efficient queries.
Data cleaning
Good data visualization skills(tools like tableau or libraries like matplotlib, seaborn, Bookeh, etc )
Statistical analysis skills. This includes familiarity with the different statistical questions types.
Experiment design and statistical testing(parametric and non-parametric testing)
Familiarity with big data frameworks/ infrastructures (spark, hive, Hadoop, mongo, etc)
Machine learning skills(skill level varies widely based on the
business logic)
Strong understanding of the full cycle of data science(stating a sharp question, exploratory data analysis, inference, formal statistical modeling, interpretation, and communication)
Story telling skills (powerpoint, etc)

Data science is a very broad field, usually you’d need to acquire new skills based on the task you are being assigned (how to build recommender systems, sequence modeling, etc) I only covered the essential skill set.

Data Analysis

Data Analysis is basically about answering a business related question using data. This question can be:

descriptive: You are simply describing the data sample you have and its related statistics. you are not interested in data outside your sample.
exploratory: You are exploring different patterns, trends in the data, seasonality, relationships, and distribution. usually done using exploratory data analysis visualization tools.
inferential: You are trying to infer some question answer about the data based on the sample you have using hypothesis testing and different statical testing techniques.
predictive: You are using different statistical tools to extrapolate some values based on some variables like predicting revenue, new users behavior, etc.
causal: This type of questions usually requires running one or more experiment to test for a causality factor between two or more variables.
mechanistic: This one questions the underlying link between two sets of variables. It is usually hard to uncover in an uncontrolled environment.

Data analysis can be considered as a subfield of data science usually
for professional with no or little technical background. It usually requires statistics, and domain related experience.

this shows the difference between data science and data analysis.

Up till now, most data analyst use tools like SPSS and similar ones; however,
there has been a new trend into hiring data analyst with skills in R/ python
since they have more powerful tools in predictive analytics and big data.

Skills required in the Python track

Familiarity with Numpy, pandas, sklearn, and matplotlib
Strong SQL skills. No-SQL skills are highly required too. Normally
this includes writing efficient queries.
Good data visualization skills(tools like tableau, or libraries like
matplotlib, seaborn, etc )
Statistical analysis skills
Experiment design and statistical testing
Understanding of basic predictive analytics tools like regression
models and clustering, cohort analysis, etc.
Strong understanding of the full cycle of data science(stating a sharp question, exploratory data analysis, inference, formal statistical modeling, interpretation, and communication)

Machine Learning Engineering:

Machine learning is the field of AI we use to automate processes that usually require human intelligence to do specially in vision and language. ML is the subfield of AI that applies that using data. There are other non-data centric approaches in AI.

Machine learning is the most technical intensive track out of them.
It requires a range of technical skills like writing efficient queries, efficient learning algorithms(in time and accuracy)

but always remember that computers can only get as smart as we program them!

Skills required in the Python track:

Familiarity with Numpy, pandas, sklearn, and matplotlib
Strong SQL, No-SQL skills are essential.
Good data visualization skills(tools like tableau, or libraries like matplotlib, seaborn, etc )
Familiarity with big data frameworks/ infrastructures (spark, hive,
Hadoop, mongo, etc)
Strong understanding of basic ml algorithms (regressions,
classification, clustering, and dimensionality reduction)
Feature Engineering and hyper-parameter fine tuning
Strong intuition of the different optimization algorithms and when to use each one.
Structuring and Evaluating ML algorithms
Understanding different neural networks structures and new viral architectures.
Reinforcement learning
Strong familiarity with one or more of tge Deep learning frameworks(Tensorflow, keras, caffe, or torch, etc)
Network analysis

Data Engineering

Data engineering is the field that cares about building data pipelines and infrastructure. This job is crucial to any company that has huge amount of data and planning to acquire a data scientist. Usually, hiring a data engineer comes before hiring a data scientist.

Skills required in the Python track:

In depth knowledge of SQL and noSQL solutions
System architecture skills
ETL and other data warehousing tools for efficient data storage
and retrieval
Familiarity with different AWS or any cloud services for data lakes,
data warehousing, etc
Big data based analytics(i.e. frameworks on top of mongo or
Hadoop like spark, hive, mapreduce)
Basic understanding of Data modeling , ML, and statistical
analysis.
Building efficient data pipelines

After all, all these fields are pretty new in industry and not yet well established. That’s why you need to keep up with the new skills, viral architectures, papers, etc.

We will follow up with another post about the best recommended online courses and degrees to learn each skill and a quick dive into each one of those bullet points.

Mesin Belajar

Wednesday, May 22, 2019

All You Need to Know to Break into the Data World and Machine Learning

Data Science

Data Analysis

Machine Learning Engineering:

Data Engineering

No comments:

Post a Comment