If you are a total beginner, in short your path should look like this:
- Learn SQL, and Python.
- Then learn Machine learning from a couple of basic courses.
- Learn probability theory, and some computational mathematics.
- The world of statistics is vast, but very interesting:
THE WORLD OF STATISTICS - Then dive into Your Home for Data Science and see what others are working on.
- Then spend your time on Scikit learn website:
- Then practice on your own, and grow bit by bit.
- If you need a curated list to follow, you can start here: https://github.com/bulutyazilim/...
- Here is a cool visualization on how a modern data scientist looks like:
I
work with people who write C/C++ programs that generate GBs of data,
people who manage TBs of data distributed across giant databases, people
who are top notch programmers in SQL, Python, R, and people who have
setup an organization wide databases working with Hadoop, Sap, Business
Intelligence etc.
My inspiration to anyone and everyone would be following:
- Learn all the basics from Coursera, but if I really have to compare what you would get out of Coursera compared to the vastness of data science, let us say ~ Coursera is as good as eating a burrito at Chipotle Mexican Grill. You certainly can satiate yourself, and you have a few things to eat there.
- The pathway to value adding data science is really quite deep, and I consider it equivalent to a five star buffet offering 20 cuisines and some 500 different recipes.
- Coursera is certainly a good starting point, and one should certainly go over these courses, but I personally never paid any money to Coursera, and I could easily learn a variety of things bit by bit over time.
- Kaggle is a really good resource for budding engineers to look at various other people’s ideas and build on them.
My own learning came from actually building things. I started with SQL,
then I learned Python, then I learned R, then I learned many libraries
in Python and R. Then I learned html, decent GUI programming using VB
script, C# programming. Then I learned Scikit learn. Finally I talked to
various statisticians at my work place whose day in day out job is to
derive conclusions out of data, and in the process I learned JMP/JSL
scripting. I learned a lot of statistics in the process.
Here is an overall sequence of how I progressed myself.
The first thing I want to inspire anyone and everyone is to learn the “science”. Data
science is 90 % Science, and 10 % managing data. Without knowing
science, and without knowing what you want to achieve and why you want
to achieve it, you would not be able to use whatever you learn on
Coursera in any way. I can almost guarantee you that.
I
have seen my friends going through some of those courses, but at the
end of the day, they do not build anything, they do not derive correct
conclusions, and they do not really “use” anything that they learn. More
than that, they do not even really use the skills they acquire.
The way all this happened to me is as follows:
- I dived deep into data, understood their structure, understood their types. I understood why we were even collecting all those data, how we were collecting them, how we were storing them, and how we were processing them before storing them.
- I learned how data could be handled with these programming languages effectively. I learned to clean the data, process them as much as I wanted to, and plot them with with every possible way I could. Just plotting the data took me hours and hours to see how various plots could show the data in one way compared to another.
- I learned from my friends who manage databases how they did that and what went in the background. I learned the structures of the database tables.
- Then I learned how to plot some relevant plots, and calculate the return on investment for doing anything. Here is where Data science started coming together. There is no plot that I cannot plot. Basically - every plot I saw on the internet, I learned how to plot it. This is extremely important, and this is what will lead you to story telling.
- Then I learned automating things, and this is really amazing, because you would be able to do a few things automatically, which would save you a lot of time.
- Automation came really easily with Python, R, VBscript, C# programming.
I can tell you that there is roughly speaking nothing that is not automated for me. I have a computer program for anything and everything, and most of my things are done by a button click ~ Or lets say - a few button clicks. - Then I learned report writing. What I learned is that I had to send a lot of data and plots to others over a mail. And believe me, people have no time, and no interest. But if you make colorful plots, write down a coherent report demonstrating what you want to say, and pack enormous and powerful information in few really colorful plots, you can make a case.
- Then I learned story telling. What this simply means is that you should be able to tell the vice president of the company what the topmost problems of your division are. And they way you should be able to derive these conclusions are by creating engaging plots that tell a story. Without this, you would not be able to convince anyone. People are not interested in numbers. All they remember is names, places, things, inspiration, and why someone wants to do something. A true data scientist is also a true presenter of the data.
- Then I read every possible blog on the internet to see how others were doing these things. How people were writing their programs, how they were creating various plots, how they were automating things and so on. I also derived a lot of ideas from how someone used their skills to do an amazing project. This is a really nice way to see how others imagine. Then you can borrow their imagination and build things, and eventually as things are easier for you, you would begin imagining things yourself.
Just take a look at the number of blogs available to you from where you can learn a lot of things.
I
have gone through many of these blogs, and I have read them in depth.
This took weeks of efforts and multiple Saturdays and Sundays
experimenting with data, and programming languages.
My most frequently used websites:
- Stack Overflow
- Python Programming Tutorials
- The Comprehensive R Archive Network
- Seaborn: statistical data visualization
- Your Home for Data Science
- 16+ Free Data Science Books
- ipython/ipython
- vinta/awesome-python
- scikit-learn: machine learning in Python
- Grace: Gallery
- Check out the amazing plots here.
- Practical Programming for Total Beginners
- Learn python the hard way
- YouTube. Yes! Just type your question here, and you should get an answer.
- Toad World
- SQL Tutorial
- CodeAcademy: Python
- http://mahout.apache.org/
- http://www.netlib.org/lapack/
- http://www.netlib.org/eispack/
- http://www.netlib.org/scalapack/
- RegExr: Learn, Build, & Test RegEx
- Regex for JavaScript, Python, PHP, and PCRE
- StatsModels: Statistics in Python: This one is a killer one! You can do a lot with this.
- Installing NLTK - NLTK 3.0 documentation
- Most read books:
- Learning Python
- The Art of R ProgrammingArt of R programming http://shop.oreilly.com/product/...
- I really don’t think I am a books person, but I do like to read them once in a while when I am in “There-is-no-way-but-to-read-the-manual” mode. I have read many statistics books, and I will update them here.
I would now give you a more comprehensive approach, so that you have a lot of inspiration to hold on to.
How does a typical engineer’s job look like, and how can data science help on those lines?
- Decision making: In my job, I have several decisions to make and several actions to take in a day. In addition, I have various stake holders to update, various people to give guidance to, various data sets to look at, and various tools and machines to handle. Some of these machines are physical machines making things, and some others are simply computer programs and software platforms creating settings for these machines.
- Data: Most of the data we have is on various servers which are distributed across various units, or is on some shared drive, or on some hard disk drive available on a server.
- Databases: These database servers can be used to get data with SQL, or direct data pull, or by grabbing them somehow (Say copying by FTP), sometimes even manually copying, and pasting into excel, CSV or notepad. Usually we have multiple methods to do direct data pull from the servers. There are various SQL platforms such as TOAD, Business Intelligence, and even in house in built platforms.
- SQL can be learned easily using these platforms, and one can create plenty of SQL scripts.
- You can even create scripts that can write scripts.
- I would inspire you to learn SQL as it is one of the most used language for just getting data.
- Data again: The data on these databases can be highly structured, or somewhat unstructured - such as human comments or so on.
- These data can often have a fixed number of variables, or varying number of variables.
- Sometimes data can be missing too, and sometimes they can be incorrectly entered in the databases.
- Every time something like this is found, and immediate response is sent to database managers, and they correct the bugs if there are any in the system.
- Usually before setting up a whole giant project of setting up a database, multiple people unite and discuss how the data should look like, how they should be distributed into various tables, and how the tables should be connected.
- Such people are true data scientists, as they know what the end user is going to want on a daily basis over and over.
- They always try to structure the data as much as possible, because it makes it very easy to handle it.
- Scripting and scheduling: Using multiple scripts that are scheduled to run at specific timings, or sometimes setup to run on an adhoc basis, I get and dump data in various folders on a dedicated computer. I have a decently large HDD to store a lot of data.
- Usually I append new data to existing data sets, and purge out older data in a timely way.
- Sometimes I have programs running with sleep commands, that at scheduled timings merely check something quickly, and sleep back again.
- More scripting: Furthermore, there are multiple scripts that are setup to crunch these data sets and create a bunch of decisions from them.
- Cleaning data, creating valuable pivot tables, and plots is one of the biggest time hold ups for anyone trying to achieve value out of this.
- To achieve something like this, first you would have to understand your data in and out, and you should be very capable of doing all sorts of hand calculations, generating excel sheets, and visualizing data.
- Science: What I would inspire you with is that before you do data-science, do the science, learn the physics behind your data, and understand it in and out. Say ~ If you work in a T-Shirt industry, you should know every aspect of a T-shift in and out, you should have access to all possible information around T-shirts, and you should know very well what the customers want and like, without even looking at any of the data.
- Without understanding the science, data-science is valueless, and trying to achieve something with it may be a fruitless effort.
- Caveats: I have seen plenty of people not even knowing what to plot against what.
- The worst I have seen is that people plot just about some random variables against each other and they derive conclusions out of them.
- True, that correlations exist in many things, but you should always know if there is any causation.
- Example: There is a significant correlation between number of Nobel Laureates and per-capita chocolate consumption of various countries; But is it a causation? May be not!
- Back to programs: There is usually a sequence in which all the scripts run, and create all sorts of tables, and plots to look at.
- Some scripts are sequential, whereas some programs are mere executables. Executables usually are written for speed, and C, C++, C# etc can be used for them.
- Scripts can be written in Python, VB etc.
- Decision making: When certain {If/Then} conditions are met, more computer programs self trigger and run more data analysis.
- Data science: This usually unfolds into a lot of statistics, classification, regression.
- Here is where machine learning comes in. One can use programming languages such as Python or R to do this.
- Based off the machine learning algorithms’ results, more computer programs are ran and more plots are generated or more programs are triggered.
- Plotting: Ultimately, a lot of plots are stored in a coherent fashion for humans to make decisions.
- Self sustaining reports: The reports are self triggering, self sustained programs that tell me what to do.
- The feeling of being ironman: I usually look at the results from all the reports in 10 mins, and make decisions on what to do next for many hours. Every now and then I look at the reports again to re-define the decisions or change them on the fly if this has to be done.
What are the advantages of doing all this?
- First of all, when a computer does something, it would do it at a much faster speed than a human.
- A computer will do it tirelessly, and endlessly.
- Computer programs need sufficient amount of training, and multiple levels of testing for varying inputs, but once all that is done, it would keep doing that job for ever until either the sample space itself changes, or something drastically changes to the input itself.
- By programming it to the level that the entire output is set on a dashboard, it is very easy to see what the order of the projects should be.
How do you now create value from something like this?
- One should always be behind science! and by knowing your data as well as possible you would be able to order the implementation of your projects.
- The decision you would make, and the actions you would take would be harder, better, faster, stronger.
- You would be able to derive conclusions and generate some lean sigma projects.
- You would be able to update the stakeholders well ahead of time, and be able to be on the top of your projects.
- You would be able to focus only on the science aspect instead of just trying to manually create plots.
- You would be able to find out trends in your data more easily, and say things one way or the other if the data tell you to make decisions in favor of one choice over other.
- Last but not the least, you can reduce human efforts significantly and automate all the things for you.
- I even have scripts that push buttons for me or fill up forms for me.
- I have several image analysis programs that analyze images and make decisions on the fly without humans looking at them.
I hope this answer is elaborate and gives you some insight on what you can work on.
This comment has been removed by the author.
ReplyDeleteI started with SQL, then I learned Python, then I learned R, then I learned many libraries in Python and R. Then I learned html, decent GUI programming using VB script, C# programming.
ReplyDeleteRegards
Ali kazmi