Thursday, May 25, 2017

Principal Component Analysis Explained Visually

http://setosa.io/ev/principal-component-analysis/

Principal component analysis (PCA) is a technique used to emphasize variation and bring out strong patterns in a dataset. It's often used to make data easy to explore and visualize.

2D example

First, consider a dataset in only two dimensions, like (height, weight). This dataset can be plotted as points in a plane. But if we want to tease out variation, PCA finds a new coordinate system in which every point has a new (x,y) value. The axes don't actually mean anything physical; they're combinations of height and weight called "principal components" that are chosen to give one axes lots of variation.
Drag the points around in the following visualization to see PC coordinate system adjusts.

PCA is useful for eliminating dimensions. Below, we've plotted the data along a pair of lines: one composed of the x-values and another of the y-values.

If we're going to only see the data along one dimension, though, it might be better to make that dimension the principal component with most variation. We don't lose much by dropping PC2 since it contributes the least to the variation in the data set.

3D example

With three dimensions, PCA is more useful, because it's hard to see through a cloud of data. In the example below, the original data are plotted in 3D, but you can project the data into 2D through a transformation no different than finding a camera angle: rotate the axes to find the best angle. To see the "official" PCA transformation, click the "Show PCA" button. The PCA transformation ensures that the horizontal axis PC1 has the most variation, the vertical axis PC2 the second-most, and a third axis PC3 the least. Obviously, PC3 is the one we drop.

Eating in the UK (a 17D example)

Original example from Mark Richardson's class notes Principal Component Analysis What if our data have way more than 3-dimensions? Like, 17 dimensions?! In the table is the average consumption of 17 types of food in grams per person per week for every country in the UK.
The table shows some interesting variations across different food types, but overall differences aren't so notable. Let's see if PCA can eliminate dimensions to emphasize how countries differ.

Here's the plot of the data along the first principal component. Already we can see something is different about Northern Ireland.

Now, see the first and second principal components, we see Northern Ireland a major outlier. Once we go back and look at the data in the table, this makes sense: the Northern Irish eat way more grams of fresh potatoes and way fewer of fresh fruits, cheese, fish and alcoholic drinks. It's a good sign that structure we've visualized reflects a big fact of real-world geography: Northern Ireland is the only of the four countries not on the island of Great Britain. (If you're confused about the differences among England, the UK and Great Britain, see: this video.)

For more explanations, visit the Explained Visually project homepage.
Or subscribe to our mailing list.

Tuesday, May 16, 2017

Roboschool

https://github.com/openai/roboschool

Roboschool

Release blog post is here:
https://blog.openai.com/roboschool/
Roboschool is a long-term project to create simulations useful for research. The roadmap is as follows:

Replicate Gym MuJoCo environments.
Take a step away from trajectory-centric fragile MuJoCo tasks.
Explore multiplayer games.
Create tasks with camera RGB image and joints in a tuple.
Teach robots to follow commands, including verbal commands.

Environments List

The list of Roboschool environments is as follows:

RoboschoolInvertedPendulum-v0
RoboschoolInvertedPendulumSwingup-v0
RoboschoolInvertedDoublePendulum-v0
RoboschoolReacher-v0
RoboschoolHopper-v0
RoboschoolWalker2d-v0
RoboschoolHalfCheetah-v0
RoboschoolAnt-v0
RoboschoolHumanoid-v0
RoboschoolHumanoidFlagrun-v0
RoboschoolHumanoidFlagrunHarder-v0
RoboschoolPong-v0

To obtain this list:

import roboschool, gym; 
print("\n".join(['- ' + spec.id for spec in gym.envs.registry.all() if 
spec.id.startswith('Roboschool')]))

Installation

First, define a ROBOSCHOOL_PATH variable in the current shell. It will be used in this README but not anywhere in the Roboschool code.

ROBOSCHOOL_PATH=/path/to/roboschool

The dependencies are gym, Qt5, assimp, tinyxml, and bullet (from a branch). For the non-bullet deps, there are several options, depending on what platform and package manager you are using.

Ubuntu:

apt install cmake ffmpeg pkg-config qtbase5-dev libassimp-dev libpython3.5-dev libboost-python-dev libtinyxml-dev

Linuxbrew

brew install boost-python --without-python --with-python3 --build-from-source
export C_INCLUDE_PATH=/home/user/.linuxbrew/include:/home/user/.linuxbrew/include/python3.6m
export CPLUS_INCLUDE_PATH=/home/user/.linuxbrew/include:/home/user/.linuxbrew/include/python3.6m
export LIBRARY_PATH=/home/user/.linuxbrew/lib
export PKG_CONFIG_PATH=/home/user/.linuxbrew/lib/pkgconfig:/usr/lib/pkgconfig:/usr/lib/x86_64-linux-gnu/pkgconfig

(still use Qt from Ubuntu, because it's known to work)

Mac, homebrew python:

# Will not work on Mavericks: unsupported by homebrew, some libraries won't compile, upgrade first
brew install python3
brew install cmake tinyxml assimp ffmpeg qt
brew install boost-python --without-python --with-python3 --build-from-source
export PATH=/usr/local/bin:/usr/local/opt/qt5/bin:$PATH
export PKG_CONFIG_PATH=/usr/local/opt/qt5/lib/pkgconfig

Mac, Anaconda with Python 3

brew install cmake tinyxml assimp ffmpeg
brew install boost-python --without-python --with-python3 --build-from-source
conda install qt
export PKG_CONFIG_PATH=$(dirname $(dirname $(which python)))/lib/pkgconfig

Compile and install bullet as follows. Note that make install will merely copy files into the roboschool directory.

git clone https://github.com/olegklimov/bullet3 -b roboschool_self_collision
mkdir bullet3/build
cd    bullet3/build
cmake -DBUILD_SHARED_LIBS=ON -DUSE_DOUBLE_PRECISION=1 -DCMAKE_INSTALL_PREFIX:PATH=$ROBOSCHOOL_PATH/roboschool/cpp-household/bullet_local_install -DBUILD_CPU_DEMOS=OFF -DBUILD_BULLET2_DEMOS=OFF -DBUILD_EXTRAS=OFF  -DBUILD_UNIT_TESTS=OFF -DBUILD_CLSOCKET=OFF -DBUILD_ENET=OFF -DBUILD_OPENGL3_DEMOS=OFF ..
make -j4
make install
cd ../..

Finally, install project itself:

pip3 install -e $ROBOSCHOOL_PATH

Now, check to see if it worked by running a pretrained agent from the agent zoo.

Agent Zoo

We have provided a number of pre-trained agents in the agent_zoo directory.
To see a humanoid run towards a random varying target:

python $ROBOSCHOOL_PATH/agent_zoo/RoboschoolHumanoidFlagrun_v0_2017may.py

To see three agents in a race:

python $ROBOSCHOOL_PATH/agent_zoo/demo_race2.py

Monday, May 15, 2017

How does a total beginner start to learn machine learning?

If you are a total beginner, in short your path should look like this:

Learn SQL, and Python.
Then learn Machine learning from a couple of basic courses.
Learn probability theory, and some computational mathematics.
The world of statistics is vast, but very interesting:
THE WORLD OF STATISTICS
Then dive into Your Home for Data Science and see what others are working on.
Then spend your time on Scikit learn website:

Then practice on your own, and grow bit by bit.
If you need a curated list to follow, you can start here: https://github.com/bulutyazilim/...
Here is a cool visualization on how a modern data scientist looks like:

I work with people who write C/C++ programs that generate GBs of data, people who manage TBs of data distributed across giant databases, people who are top notch programmers in SQL, Python, R, and people who have setup an organization wide databases working with Hadoop, Sap, Business Intelligence etc.

My inspiration to anyone and everyone would be following:

Learn all the basics from Coursera, but if I really have to compare what you would get out of Coursera compared to the vastness of data science, let us say ~ Coursera is as good as eating a burrito at Chipotle Mexican Grill. You certainly can satiate yourself, and you have a few things to eat there.
The pathway to value adding data science is really quite deep, and I consider it equivalent to a five star buffet offering 20 cuisines and some 500 different recipes.
Coursera is certainly a good starting point, and one should certainly go over these courses, but I personally never paid any money to Coursera, and I could easily learn a variety of things bit by bit over time.
Kaggle is a really good resource for budding engineers to look at various other people’s ideas and build on them.

My own learning came from actually building things. I started with SQL, then I learned Python, then I learned R, then I learned many libraries in Python and R. Then I learned html, decent GUI programming using VB script, C# programming. Then I learned Scikit learn. Finally I talked to various statisticians at my work place whose day in day out job is to derive conclusions out of data, and in the process I learned JMP/JSL scripting. I learned a lot of statistics in the process.

Here is an overall sequence of how I progressed myself.

The first thing I want to inspire anyone and everyone is to learn the “science”. Data science is 90 % Science, and 10 % managing data. Without knowing science, and without knowing what you want to achieve and why you want to achieve it, you would not be able to use whatever you learn on Coursera in any way. I can almost guarantee you that.

I have seen my friends going through some of those courses, but at the end of the day, they do not build anything, they do not derive correct conclusions, and they do not really “use” anything that they learn. More than that, they do not even really use the skills they acquire.

The way all this happened to me is as follows:

I dived deep into data, understood their structure, understood their types. I understood why we were even collecting all those data, how we were collecting them, how we were storing them, and how we were processing them before storing them.
I learned how data could be handled with these programming languages effectively. I learned to clean the data, process them as much as I wanted to, and plot them with with every possible way I could. Just plotting the data took me hours and hours to see how various plots could show the data in one way compared to another.
I learned from my friends who manage databases how they did that and what went in the background. I learned the structures of the database tables.
Then I learned how to plot some relevant plots, and calculate the return on investment for doing anything. Here is where Data science started coming together. There is no plot that I cannot plot. Basically - every plot I saw on the internet, I learned how to plot it. This is extremely important, and this is what will lead you to story telling.
Then I learned automating things, and this is really amazing, because you would be able to do a few things automatically, which would save you a lot of time.
Automation came really easily with Python, R, VBscript, C# programming.
I can tell you that there is roughly speaking nothing that is not automated for me. I have a computer program for anything and everything, and most of my things are done by a button click ~ Or lets say - a few button clicks.
Then I learned report writing. What I learned is that I had to send a lot of data and plots to others over a mail. And believe me, people have no time, and no interest. But if you make colorful plots, write down a coherent report demonstrating what you want to say, and pack enormous and powerful information in few really colorful plots, you can make a case.
Then I learned story telling. What this simply means is that you should be able to tell the vice president of the company what the topmost problems of your division are. And they way you should be able to derive these conclusions are by creating engaging plots that tell a story. Without this, you would not be able to convince anyone. People are not interested in numbers. All they remember is names, places, things, inspiration, and why someone wants to do something. A true data scientist is also a true presenter of the data.
Then I read every possible blog on the internet to see how others were doing these things. How people were writing their programs, how they were creating various plots, how they were automating things and so on. I also derived a lot of ideas from how someone used their skills to do an amazing project. This is a really nice way to see how others imagine. Then you can borrow their imagination and build things, and eventually as things are easier for you, you would begin imagining things yourself.

Just take a look at the number of blogs available to you from where you can learn a lot of things.

The Ultimate Guide to Data Science Blogs.

I have gone through many of these blogs, and I have read them in depth. This took weeks of efforts and multiple Saturdays and Sundays experimenting with data, and programming languages.

My most frequently used websites:

Stack Overflow
Python Programming Tutorials
The Comprehensive R Archive Network
Seaborn: statistical data visualization
Your Home for Data Science
16+ Free Data Science Books
ipython/ipython
vinta/awesome-python
scikit-learn: machine learning in Python
Grace: Gallery

Check out the amazing plots here.

Practical Programming for Total Beginners
Learn python the hard way
YouTube. Yes! Just type your question here, and you should get an answer.
Toad World
SQL Tutorial
CodeAcademy: Python
http://mahout.apache.org/
http://www.netlib.org/lapack/
http://www.netlib.org/eispack/
http://www.netlib.org/scalapack/
RegExr: Learn, Build, & Test RegEx
Regex for JavaScript, Python, PHP, and PCRE
StatsModels: Statistics in Python: This one is a killer one! You can do a lot with this.
Installing NLTK - NLTK 3.0 documentation
Most read books:
Learning Python
The Art of R ProgrammingArt of R programming http://shop.oreilly.com/product/...
I really don’t think I am a books person, but I do like to read them once in a while when I am in “There-is-no-way-but-to-read-the-manual” mode. I have read many statistics books, and I will update them here.

I would now give you a more comprehensive approach, so that you have a lot of inspiration to hold on to.

How does a typical engineer’s job look like, and how can data science help on those lines?

Decision making: In my job, I have several decisions to make and several actions to take in a day. In addition, I have various stake holders to update, various people to give guidance to, various data sets to look at, and various tools and machines to handle. Some of these machines are physical machines making things, and some others are simply computer programs and software platforms creating settings for these machines.
Data: Most of the data we have is on various servers which are distributed across various units, or is on some shared drive, or on some hard disk drive available on a server.
Databases: These database servers can be used to get data with SQL, or direct data pull, or by grabbing them somehow (Say copying by FTP), sometimes even manually copying, and pasting into excel, CSV or notepad. Usually we have multiple methods to do direct data pull from the servers. There are various SQL platforms such as TOAD, Business Intelligence, and even in house in built platforms.

SQL can be learned easily using these platforms, and one can create plenty of SQL scripts.
You can even create scripts that can write scripts.
I would inspire you to learn SQL as it is one of the most used language for just getting data.

Data again: The data on these databases can be highly structured, or somewhat unstructured - such as human comments or so on.

These data can often have a fixed number of variables, or varying number of variables.
Sometimes data can be missing too, and sometimes they can be incorrectly entered in the databases.

Every time something like this is found, and immediate response is sent to database managers, and they correct the bugs if there are any in the system.
Usually before setting up a whole giant project of setting up a database, multiple people unite and discuss how the data should look like, how they should be distributed into various tables, and how the tables should be connected.
Such people are true data scientists, as they know what the end user is going to want on a daily basis over and over.
They always try to structure the data as much as possible, because it makes it very easy to handle it.

Scripting and scheduling: Using multiple scripts that are scheduled to run at specific timings, or sometimes setup to run on an adhoc basis, I get and dump data in various folders on a dedicated computer. I have a decently large HDD to store a lot of data.

Usually I append new data to existing data sets, and purge out older data in a timely way.
Sometimes I have programs running with sleep commands, that at scheduled timings merely check something quickly, and sleep back again.

More scripting: Furthermore, there are multiple scripts that are setup to crunch these data sets and create a bunch of decisions from them.

Cleaning data, creating valuable pivot tables, and plots is one of the biggest time hold ups for anyone trying to achieve value out of this.
To achieve something like this, first you would have to understand your data in and out, and you should be very capable of doing all sorts of hand calculations, generating excel sheets, and visualizing data.
Science: What I would inspire you with is that before you do data-science, do the science, learn the physics behind your data, and understand it in and out. Say ~ If you work in a T-Shirt industry, you should know every aspect of a T-shift in and out, you should have access to all possible information around T-shirts, and you should know very well what the customers want and like, without even looking at any of the data.
Without understanding the science, data-science is valueless, and trying to achieve something with it may be a fruitless effort.
Caveats: I have seen plenty of people not even knowing what to plot against what.

The worst I have seen is that people plot just about some random variables against each other and they derive conclusions out of them.
True, that correlations exist in many things, but you should always know if there is any causation.
Example: There is a significant correlation between number of Nobel Laureates and per-capita chocolate consumption of various countries; But is it a causation? May be not!

Back to programs: There is usually a sequence in which all the scripts run, and create all sorts of tables, and plots to look at.

Some scripts are sequential, whereas some programs are mere executables. Executables usually are written for speed, and C, C++, C# etc can be used for them.
Scripts can be written in Python, VB etc.

Decision making: When certain {If/Then} conditions are met, more computer programs self trigger and run more data analysis.
Data science: This usually unfolds into a lot of statistics, classification, regression.

Here is where machine learning comes in. One can use programming languages such as Python or R to do this.
Based off the machine learning algorithms’ results, more computer programs are ran and more plots are generated or more programs are triggered.

Plotting: Ultimately, a lot of plots are stored in a coherent fashion for humans to make decisions.
Self sustaining reports: The reports are self triggering, self sustained programs that tell me what to do.
The feeling of being ironman: I usually look at the results from all the reports in 10 mins, and make decisions on what to do next for many hours. Every now and then I look at the reports again to re-define the decisions or change them on the fly if this has to be done.

What are the advantages of doing all this?

First of all, when a computer does something, it would do it at a much faster speed than a human.
A computer will do it tirelessly, and endlessly.
Computer programs need sufficient amount of training, and multiple levels of testing for varying inputs, but once all that is done, it would keep doing that job for ever until either the sample space itself changes, or something drastically changes to the input itself.
By programming it to the level that the entire output is set on a dashboard, it is very easy to see what the order of the projects should be.

How do you now create value from something like this?

One should always be behind science! and by knowing your data as well as possible you would be able to order the implementation of your projects.
The decision you would make, and the actions you would take would be harder, better, faster, stronger.
You would be able to derive conclusions and generate some lean sigma projects.
You would be able to update the stakeholders well ahead of time, and be able to be on the top of your projects.
You would be able to focus only on the science aspect instead of just trying to manually create plots.
You would be able to find out trends in your data more easily, and say things one way or the other if the data tell you to make decisions in favor of one choice over other.
Last but not the least, you can reduce human efforts significantly and automate all the things for you.

I even have scripts that push buttons for me or fill up forms for me.
I have several image analysis programs that analyze images and make decisions on the fly without humans looking at them.

I hope this answer is elaborate and gives you some insight on what you can work on.

Wednesday, May 10, 2017

Flower Species Recognition using Global Feature Descriptors

https://gogul09.github.io/software/flower-recognition

Sunday, May 7, 2017

Neural Network Classification Model for Handwritten Digit Recognition

https://www.kaggle.com/statinstilettos/neural-network-approach

Exploratory Analysis

Summary statistics and visualizations of data. The data is first preprocessed by visualizing the sample size for each digit in the dataset, plotting a few of the digits using the data provided to get an understanding of exactly what the data represents, normalizing the data, and reducing the features using PCA. It is important to note that this dataset is sparse, meaning that there are mostly 0's in the feature matrix. Some pixels carry a lot of information about the digit written, while other pixel features such as the edges and usually 0 and not very informative. The scale of the data used to represent the pixels is not numerically meaningful, which leads to the need to normalize the data so that the values do not contribute to the model in an improper way.

Saturday, May 6, 2017

image database

https://www.architecturaldesigns.com/house-plans/collections/photo-gallery?utf8=%E2%9C%93&sort=mp&num=&tla_min=&tla_max=&w_min=&w_max=&d_min=&d_max=&st%5B%5D=172&cl%5B%5D=208

https://en.wikipedia.org/wiki/List_of_house_styles

http://www.nmhistoricpreservation.org/assets/files/arms/HCPIArchitecturalStyles20131115.pdf

Genetic Fuzzy based Artificial Intelligence for Unmanned Combat Aerial Vehicle Control in Simulated Air Combat Missions

https://www.omicsgroup.org/journals/genetic-fuzzy-based-artificial-intelligence-for-unmanned-combat-aerialvehicle-control-in-simulated-air-combat-missions-2167-0374-1000144.pdf