Thursday, May 25, 2017

Principal Component Analysis Explained Visually

http://setosa.io/ev/principal-component-analysis/

Principal component analysis (PCA) is a technique used to emphasize variation and bring out strong patterns in a dataset. It's often used to make data easy to explore and visualize.

2D example

First, consider a dataset in only two dimensions, like (height, weight). This dataset can be plotted as points in a plane. But if we want to tease out variation, PCA finds a new coordinate system in which every point has a new (x,y) value. The axes don't actually mean anything physical; they're combinations of height and weight called "principal components" that are chosen to give one axes lots of variation.
Drag the points around in the following visualization to see PC coordinate system adjusts.
original data set0246810x0246810youtput from PCA-6-4-20246pc1-6-4-20246pc2
PCA is useful for eliminating dimensions. Below, we've plotted the data along a pair of lines: one composed of the x-values and another of the y-values.
If we're going to only see the data along one dimension, though, it might be better to make that dimension the principal component with most variation. We don't lose much by dropping PC2 since it contributes the least to the variation in the data set.
0246810x0246810y-6-4-20246pc1-6-4-20246pc2

3D example

With three dimensions, PCA is more useful, because it's hard to see through a cloud of data. In the example below, the original data are plotted in 3D, but you can project the data into 2D through a transformation no different than finding a camera angle: rotate the axes to find the best angle. To see the "official" PCA transformation, click the "Show PCA" button. The PCA transformation ensures that the horizontal axis PC1 has the most variation, the vertical axis PC2 the second-most, and a third axis PC3 the least. Obviously, PC3 is the one we drop.
-10-50510pc1-10-50510pc2
-10-50510x-10-50510y-10-50510z-10-50510pc1-10-50510pc2-10-50510pc3

Eating in the UK (a 17D example)

Original example from Mark Richardson's class notes Principal Component Analysis What if our data have way more than 3-dimensions? Like, 17 dimensions?! In the table is the average consumption of 17 types of food in grams per person per week for every country in the UK.
The table shows some interesting variations across different food types, but overall differences aren't so notable. Let's see if PCA can eliminate dimensions to emphasize how countries differ.
3755724514721055419314711027202536854881983601374156135472671494664120993674103314358635518733415061394585324214621036218412295756617175041822033715721474757322715821036423516011378742658035702033651256175EnglandN IrelandScotlandWalesAlcoholic drinksBeveragesCarcase meatCerealsCheeseConfectioneryFats and oilsFishFresh fruitFresh potatoesFresh VegOther meatOther VegProcessed potatoesProcessed VegSoft drinksSugars
Here's the plot of the data along the first principal component. Already we can see something is different about Northern Ireland.
-300-200-1000100200300400500pc1EnglandWalesScotlandN Ireland
Now, see the first and second principal components, we see Northern Ireland a major outlier. Once we go back and look at the data in the table, this makes sense: the Northern Irish eat way more grams of fresh potatoes and way fewer of fresh fruits, cheese, fish and alcoholic drinks. It's a good sign that structure we've visualized reflects a big fact of real-world geography: Northern Ireland is the only of the four countries not on the island of Great Britain. (If you're confused about the differences among England, the UK and Great Britain, see: this video.)


-300-200-1000100200300400500-400-300-200-1000100200300400pc1pc2EnglandWalesScotlandN Ireland


For more explanations, visit the Explained Visually project homepage.
Or subscribe to our mailing list.

No comments:

Post a Comment