How does a total beginner start to learn machine learning?
If you are a total beginner, in short your path should look like this:
- Learn SQL, and Python.
- Then learn Machine learning from a couple of basic courses.
- Learn probability theory, and some computational mathematics.
- The world of statistics is vast, but very interesting:
THE WORLD OF STATISTICS
- Then dive into Your Home for Data Science and see what others are working on.
- Then spend your time on Scikit learn website:
- Then practice on your own, and grow bit by bit.
- If you need a curated list to follow, you can start here: https://github.com/bulutyazilim/...
- Here is a cool visualization on how a modern data scientist looks like:
I
work with people who write C/C++ programs that generate GBs of data,
people who manage TBs of data distributed across giant databases, people
who are top notch programmers in SQL, Python, R, and people who have
setup an organization wide databases working with Hadoop, Sap, Business
Intelligence etc.
My inspiration to anyone and everyone would be following:
- Learn
all the basics from Coursera, but if I really have to compare what you
would get out of Coursera compared to the vastness of data science, let
us say ~ Coursera is as good as eating a burrito at Chipotle Mexican
Grill. You certainly can satiate yourself, and you have a few things to
eat there.
- The pathway to value adding data science is really
quite deep, and I consider it equivalent to a five star buffet offering
20 cuisines and some 500 different recipes.
- Coursera is
certainly a good starting point, and one should certainly go over these
courses, but I personally never paid any money to Coursera, and I could
easily learn a variety of things bit by bit over time.
- Kaggle is a really good resource for budding engineers to look at various other people’s ideas and build on them.
My own learning came from actually building things. I started with SQL,
then I learned Python, then I learned R, then I learned many libraries
in Python and R. Then I learned html, decent GUI programming using VB
script, C# programming. Then I learned Scikit learn. Finally I talked to
various statisticians at my work place whose day in day out job is to
derive conclusions out of data, and in the process I learned JMP/JSL
scripting. I learned a lot of statistics in the process.
Here is an overall sequence of how I progressed myself.
The first thing I want to inspire anyone and everyone is to learn the “science”. Data
science is 90 % Science, and 10 % managing data. Without knowing
science, and without knowing what you want to achieve and why you want
to achieve it, you would not be able to use whatever you learn on
Coursera in any way. I can almost guarantee you that.
I
have seen my friends going through some of those courses, but at the
end of the day, they do not build anything, they do not derive correct
conclusions, and they do not really “use” anything that they learn. More
than that, they do not even really use the skills they acquire.
The way all this happened to me is as follows:
- I
dived deep into data, understood their structure, understood their
types. I understood why we were even collecting all those data, how we
were collecting them, how we were storing them, and how we were
processing them before storing them.
- I learned how data could be
handled with these programming languages effectively. I learned to
clean the data, process them as much as I wanted to, and plot them with
with every possible way I could. Just plotting the data took me hours
and hours to see how various plots could show the data in one way
compared to another.
- I learned from my friends who manage
databases how they did that and what went in the background. I learned
the structures of the database tables.
- Then I learned how to
plot some relevant plots, and calculate the return on investment for
doing anything. Here is where Data science started coming together.
There is no plot that I cannot plot. Basically - every plot I saw on the
internet, I learned how to plot it. This is extremely important, and
this is what will lead you to story telling.
- Then I learned
automating things, and this is really amazing, because you would be able
to do a few things automatically, which would save you a lot of time.
- Automation came really easily with Python, R, VBscript, C# programming.
I
can tell you that there is roughly speaking nothing that is not
automated for me. I have a computer program for anything and everything,
and most of my things are done by a button click ~ Or lets say - a few
button clicks.
- Then I learned report writing. What I learned is
that I had to send a lot of data and plots to others over a mail. And
believe me, people have no time, and no interest. But if you make
colorful plots, write down a coherent report demonstrating what you want
to say, and pack enormous and powerful information in few really
colorful plots, you can make a case.
- Then I learned story
telling. What this simply means is that you should be able to tell the
vice president of the company what the topmost problems of your division
are. And they way you should be able to derive these conclusions are by
creating engaging plots that tell a story. Without this, you would not
be able to convince anyone. People are not interested in numbers. All
they remember is names, places, things, inspiration, and why someone
wants to do something. A true data scientist is also a true presenter of
the data.
- Then I read every possible blog on the internet to
see how others were doing these things. How people were writing their
programs, how they were creating various plots, how they were automating
things and so on. I also derived a lot of ideas from how someone used
their skills to do an amazing project. This is a really nice way to see
how others imagine. Then you can borrow their imagination and build
things, and eventually as things are easier for you, you would begin
imagining things yourself.
Just take a look at the number of blogs available to you from where you can learn a lot of things.
I
have gone through many of these blogs, and I have read them in depth.
This took weeks of efforts and multiple Saturdays and Sundays
experimenting with data, and programming languages.
My most frequently used websites:
I would now give you a more comprehensive approach, so that you have a lot of inspiration to hold on to.
How does a typical engineer’s job look like, and how can data science help on those lines?
- Decision making: In
my job, I have several decisions to make and several actions to take in
a day. In addition, I have various stake holders to update, various
people to give guidance to, various data sets to look at, and various
tools and machines to handle. Some of these machines are physical
machines making things, and some others are simply computer programs and
software platforms creating settings for these machines.
- Data: Most
of the data we have is on various servers which are distributed across
various units, or is on some shared drive, or on some hard disk drive
available on a server.
- Databases: These database servers
can be used to get data with SQL, or direct data pull, or by grabbing
them somehow (Say copying by FTP), sometimes even manually copying, and
pasting into excel, CSV or notepad. Usually we have multiple methods to
do direct data pull from the servers. There are various SQL platforms
such as TOAD, Business Intelligence, and even in house in built
platforms.
- SQL can be learned easily using these platforms, and one can create plenty of SQL scripts.
- You can even create scripts that can write scripts.
- I would inspire you to learn SQL as it is one of the most used language for just getting data.
- Data again: The data on these databases can be highly structured, or somewhat unstructured - such as human comments or so on.
- These data can often have a fixed number of variables, or varying number of variables.
- Sometimes data can be missing too, and sometimes they can be incorrectly entered in the databases.
- Every
time something like this is found, and immediate response is sent to
database managers, and they correct the bugs if there are any in the
system.
- Usually before setting up a whole giant project of
setting up a database, multiple people unite and discuss how the data
should look like, how they should be distributed into various tables,
and how the tables should be connected.
- Such people are true data scientists, as they know what the end user is going to want on a daily basis over and over.
- They always try to structure the data as much as possible, because it makes it very easy to handle it.
- Scripting and scheduling: Using
multiple scripts that are scheduled to run at specific timings, or
sometimes setup to run on an adhoc basis, I get and dump data in various
folders on a dedicated computer. I have a decently large HDD to store a
lot of data.
- Usually I append new data to existing data sets, and purge out older data in a timely way.
- Sometimes
I have programs running with sleep commands, that at scheduled timings
merely check something quickly, and sleep back again.
- More scripting: Furthermore, there are multiple scripts that are setup to crunch these data sets and create a bunch of decisions from them.
- Cleaning
data, creating valuable pivot tables, and plots is one of the biggest
time hold ups for anyone trying to achieve value out of this.
- To
achieve something like this, first you would have to understand your
data in and out, and you should be very capable of doing all sorts of
hand calculations, generating excel sheets, and visualizing data.
- Science: What
I would inspire you with is that before you do data-science, do the
science, learn the physics behind your data, and understand it in and
out. Say ~ If you work in a T-Shirt industry, you should know every
aspect of a T-shift in and out, you should have access to all possible
information around T-shirts, and you should know very well what the
customers want and like, without even looking at any of the data.
- Without understanding the science, data-science is valueless, and trying to achieve something with it may be a fruitless effort.
- Caveats: I have seen plenty of people not even knowing what to plot against what.
- The
worst I have seen is that people plot just about some random variables
against each other and they derive conclusions out of them.
- True, that correlations exist in many things, but you should always know if there is any causation.
- Example:
There is a significant correlation between number of Nobel Laureates
and per-capita chocolate consumption of various countries; But is it a
causation? May be not!
- Back to programs: There is usually a sequence in which all the scripts run, and create all sorts of tables, and plots to look at.
- Some
scripts are sequential, whereas some programs are mere executables.
Executables usually are written for speed, and C, C++, C# etc can be
used for them.
- Scripts can be written in Python, VB etc.
- Decision making: When certain {If/Then} conditions are met, more computer programs self trigger and run more data analysis.
- Data science: This usually unfolds into a lot of statistics, classification, regression.
- Here is where machine learning comes in. One can use programming languages such as Python or R to do this.
- Based
off the machine learning algorithms’ results, more computer programs
are ran and more plots are generated or more programs are triggered.
- Plotting: Ultimately, a lot of plots are stored in a coherent fashion for humans to make decisions.
- Self sustaining reports: The reports are self triggering, self sustained programs that tell me what to do.
- The feeling of being ironman: I
usually look at the results from all the reports in 10 mins, and make
decisions on what to do next for many hours. Every now and then I look
at the reports again to re-define the decisions or change them on the
fly if this has to be done.
What are the advantages of doing all this?
- First of all, when a computer does something, it would do it at a much faster speed than a human.
- A computer will do it tirelessly, and endlessly.
- Computer
programs need sufficient amount of training, and multiple levels of
testing for varying inputs, but once all that is done, it would keep
doing that job for ever until either the sample space itself changes, or
something drastically changes to the input itself.
- By
programming it to the level that the entire output is set on a
dashboard, it is very easy to see what the order of the projects should
be.
How do you now create value from something like this?
- One
should always be behind science! and by knowing your data as well as
possible you would be able to order the implementation of your projects.
- The decision you would make, and the actions you would take would be harder, better, faster, stronger.
- You would be able to derive conclusions and generate some lean sigma projects.
- You would be able to update the stakeholders well ahead of time, and be able to be on the top of your projects.
- You would be able to focus only on the science aspect instead of just trying to manually create plots.
- You
would be able to find out trends in your data more easily, and say
things one way or the other if the data tell you to make decisions in
favor of one choice over other.
- Last but not the least, you can reduce human efforts significantly and automate all the things for you.
- I even have scripts that push buttons for me or fill up forms for me.
- I have several image analysis programs that analyze images and make decisions on the fly without humans looking at them.
I hope this answer is elaborate and gives you some insight on what you can work on.